TY - GEN
T1 - Drizzle
T2 - 26th ACM Symposium on Operating Systems Principles, SOSP 2017
AU - Venkataraman, Shivaram
AU - Armbrust, Michael
AU - Panda, Aurojit
AU - Ghodsi, Ali
AU - Ousterhout, Kay
AU - Franklin, Michael J.
AU - Recht, Benjamin
AU - Stoica, Ion
N1 - Funding Information:
We would like to thank the anonymous reviewers and our shepherd, Marco Serafini, for their insightful comments that improved this paper. We would also like to thank Ganesh Ananthanarayanan, Evan Sparks and Matei Zaharia for their feedback on earlier drafts of this paper. We also thank Jibin Zhan and Yan Li for discussions on video analytics applications. This research is supported in part by DHS Award HSHQDC-16-3-00083, NSF CISE Expeditions Award CCF-1139158, and gifts from Ant Financial, Amazon Web Services, CapitalOne, Ericsson, GE, Google, Huawei, Intel, IBM, Microsoft and VMware. BR is generously supported by ONR award N00014-17-1-2401, NSF award CCF-1359814, ONR awards N00014-14-1-0024 and N00014-17-1-2191, a Sloan Research Fellowship, a Google Faculty Award, and research grants from Amazon.
Publisher Copyright:
© 2017 Copyright held by the owner/author(s). Publication rights licensed to Association for Computing Machinery.
PY - 2017/10/14
Y1 - 2017/10/14
N2 - Large scale streaming systems aim to provide high throughput and low latency. They are often used to run mission-critical applications, and must be available 24x7. Thus such systems need to adapt to failures and inherent changes in workloads, with minimal impact on latency and throughput. Unfortunately, existing solutions require operators to choose between achieving low latency during normal operation and incurring minimal impact during adaptation. Continuous operator streaming systems, such as Naiad and Flink, provide low latency during normal execution but incur high overheads during adaptation (e.g., recovery), while micro-batch systems, such as Spark Streaming and FlumeJava, adapt rapidly at the cost of high latency during normal operations. Our key observation is that while streaming workloads require millisecond-level processing, workload and cluster properties change less frequently. Based on this, we develop Drizzle, a system that decouples the processing interval from the coordination interval used for fault tolerance and adaptability. Our experiments on a 128 node EC2 cluster show that on the Yahoo Streaming Benchmark, Drizzle can achieve end-to-end record processing latencies of less than 100ms and can get 2–3x lower latency than Spark. Drizzle also exhibits better adaptability, and can recover from failures 4x faster than Flink while having up to 13x lower latency during recovery.
AB - Large scale streaming systems aim to provide high throughput and low latency. They are often used to run mission-critical applications, and must be available 24x7. Thus such systems need to adapt to failures and inherent changes in workloads, with minimal impact on latency and throughput. Unfortunately, existing solutions require operators to choose between achieving low latency during normal operation and incurring minimal impact during adaptation. Continuous operator streaming systems, such as Naiad and Flink, provide low latency during normal execution but incur high overheads during adaptation (e.g., recovery), while micro-batch systems, such as Spark Streaming and FlumeJava, adapt rapidly at the cost of high latency during normal operations. Our key observation is that while streaming workloads require millisecond-level processing, workload and cluster properties change less frequently. Based on this, we develop Drizzle, a system that decouples the processing interval from the coordination interval used for fault tolerance and adaptability. Our experiments on a 128 node EC2 cluster show that on the Yahoo Streaming Benchmark, Drizzle can achieve end-to-end record processing latencies of less than 100ms and can get 2–3x lower latency than Spark. Drizzle also exhibits better adaptability, and can recover from failures 4x faster than Flink while having up to 13x lower latency during recovery.
KW - Performance
KW - Reliability
KW - Stream Processing
UR - http://www.scopus.com/inward/record.url?scp=85041649686&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85041649686&partnerID=8YFLogxK
U2 - 10.1145/3132747.3132750
DO - 10.1145/3132747.3132750
M3 - Conference contribution
AN - SCOPUS:85041649686
T3 - SOSP 2017 - Proceedings of the 26th ACM Symposium on Operating Systems Principles
SP - 374
EP - 389
BT - SOSP 2017 - Proceedings of the 26th ACM Symposium on Operating Systems Principles
PB - Association for Computing Machinery, Inc
Y2 - 28 October 2017 through 31 October 2017
ER -