TY - GEN
T1 - Runtime Protocol Refinement Checking for Distributed Protocol Implementations
AU - Ding, Ding
AU - Wang, Zhanghan
AU - Li, Jinyang
AU - Panda, Aurojit
N1 - Publisher Copyright:
© 2025 by The USENIX Association All Rights Reserved.
PY - 2025
Y1 - 2025
N2 - Despite significant progress in verifying protocols, services that implement distributed protocols (we refer to these as DPIs in what follows), e.g., Chubby or Etcd, can exhibit safety bugs in production deployments. These bugs are often introduced by programmers when converting protocol descriptions into code. This paper introduces Runtime Protocol Refinement Checking (RPRC), a runtime approach for detecting protocol implementation bugs in DPIs. RPRC systems observe a deployed DPI's runtime behavior and notify operators when this behavior evidences a protocol implementation bug, allowing operators to mitigate the bugs impact and developers to fix the bug. We have developed an algorithm for RPRC and implemented it in a system called Ellsberg that targets DPIs that assume fail-stop failures and the asynchronous (or partially synchronous) model. Our goal when designing Ellsberg was to make no assumptions about how DPIs are implemented, and to avoid additional coordination or communication. Therefore, Ellsberg builds on the observation that in the absence of Byzantine failures, a protocol safety properties are maintained if all live DPI processes correctly implement the protocol. Thus, Ellsberg checks RPRC by comparing messages sent and received by each DPI process to those produced by a simulated execution of the protocol. We apply Ellsberg to three open source DPIs, Etcd, Zookeeper and Redis Raft, and show that we can detect previously reported protocol bugs in these DPIs.
AB - Despite significant progress in verifying protocols, services that implement distributed protocols (we refer to these as DPIs in what follows), e.g., Chubby or Etcd, can exhibit safety bugs in production deployments. These bugs are often introduced by programmers when converting protocol descriptions into code. This paper introduces Runtime Protocol Refinement Checking (RPRC), a runtime approach for detecting protocol implementation bugs in DPIs. RPRC systems observe a deployed DPI's runtime behavior and notify operators when this behavior evidences a protocol implementation bug, allowing operators to mitigate the bugs impact and developers to fix the bug. We have developed an algorithm for RPRC and implemented it in a system called Ellsberg that targets DPIs that assume fail-stop failures and the asynchronous (or partially synchronous) model. Our goal when designing Ellsberg was to make no assumptions about how DPIs are implemented, and to avoid additional coordination or communication. Therefore, Ellsberg builds on the observation that in the absence of Byzantine failures, a protocol safety properties are maintained if all live DPI processes correctly implement the protocol. Thus, Ellsberg checks RPRC by comparing messages sent and received by each DPI process to those produced by a simulated execution of the protocol. We apply Ellsberg to three open source DPIs, Etcd, Zookeeper and Redis Raft, and show that we can detect previously reported protocol bugs in these DPIs.
UR - http://www.scopus.com/inward/record.url?scp=105006429504&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105006429504&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:105006429504
T3 - Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation, NSDI 2025
SP - 1305
EP - 1326
BT - Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation, NSDI 2025
PB - USENIX Association
T2 - 22nd USENIX Symposium on Networked Systems Design and Implementation, NSDI 2025
Y2 - 28 April 2025 through 30 April 2025
ER -