Runtime Protocol Refinement Checking for Distributed Protocol Implementations

Ding Ding, Zhanghan Wang, Jinyang Li, Aurojit Panda

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Despite significant progress in verifying protocols, services that implement distributed protocols (we refer to these as DPIs in what follows), e.g., Chubby or Etcd, can exhibit safety bugs in production deployments. These bugs are often introduced by programmers when converting protocol descriptions into code. This paper introduces Runtime Protocol Refinement Checking (RPRC), a runtime approach for detecting protocol implementation bugs in DPIs. RPRC systems observe a deployed DPI's runtime behavior and notify operators when this behavior evidences a protocol implementation bug, allowing operators to mitigate the bugs impact and developers to fix the bug. We have developed an algorithm for RPRC and implemented it in a system called Ellsberg that targets DPIs that assume fail-stop failures and the asynchronous (or partially synchronous) model. Our goal when designing Ellsberg was to make no assumptions about how DPIs are implemented, and to avoid additional coordination or communication. Therefore, Ellsberg builds on the observation that in the absence of Byzantine failures, a protocol safety properties are maintained if all live DPI processes correctly implement the protocol. Thus, Ellsberg checks RPRC by comparing messages sent and received by each DPI process to those produced by a simulated execution of the protocol. We apply Ellsberg to three open source DPIs, Etcd, Zookeeper and Redis Raft, and show that we can detect previously reported protocol bugs in these DPIs.

Original languageEnglish (US)
Title of host publicationProceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation, NSDI 2025
PublisherUSENIX Association
Pages1305-1326
Number of pages22
ISBN (Electronic)9781939133465
StatePublished - 2025
Event22nd USENIX Symposium on Networked Systems Design and Implementation, NSDI 2025 - Philadelphia, United States
Duration: Apr 28 2025Apr 30 2025

Publication series

NameProceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation, NSDI 2025

Conference

Conference22nd USENIX Symposium on Networked Systems Design and Implementation, NSDI 2025
Country/TerritoryUnited States
CityPhiladelphia
Period4/28/254/30/25

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Control and Systems Engineering

Fingerprint

Dive into the research topics of 'Runtime Protocol Refinement Checking for Distributed Protocol Implementations'. Together they form a unique fingerprint.

Cite this