Skip to main content

Fault Tolerance and High Availability in Data Stream Management Systems

  • Reference work entry
  • First Online:
  • 60 Accesses

Definition

Just like any other software system, a data stream management system (DSMS) can experience failures of its different components. Failures are especially common in distributed DSMSs, where query operators are spread across multiple processing nodes, i.e., independent processes typically running on different physical machines in a local-area network (LAN) or in a wide area network (WAN). Failures of processing nodes or failures in the underlying communication network can cause continuous queries (CQ) in a DSMS to stall or produce erroneous results. These failures can adversely affect critical client applications relying on these queries.

Traditionally, availability has been defined as the fraction of time that a system remains operational and properly services requests. In DSMSs, however, availability often also incorporates end-to-end latencies as applications need to quickly react to real-time events and thus can tolerate only small delays. A DSMS can handle failures using a...

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   4,499.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   6,499.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

  1. Balazinska M. Fault-tolerance and load management in a distributed stream processing system. Ph.D. thesis, Massachusetts Institute of Technology; 2006.

    Google Scholar 

  2. Balazinska M, Balakrishnan H, Madden S, Stonebraker M. Fault-tolerance in the borealis distributed stream processing system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005. p. 13–24.

    Google Scholar 

  3. Brewer EA. Lessons from giant-scale services. IEEE Internet Comput. 2001;5(4):46–55.

    Article  Google Scholar 

  4. Elnozahy ENM, Alvisi L, Wang YM, Johnson DB. A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv. 2002;34(3):375–408.

    Article  Google Scholar 

  5. Gray J. Why do computers stop and what can be done about it? Technical Report 85.7, Tandem Computers; 1985.

    Google Scholar 

  6. Gray J, Helland P, O’ Neil P, Shasha D. The dangers of replication and a solution. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1996. p. 173–82.

    Article  Google Scholar 

  7. Hwang JH, Balazinska M, Rasin A, Çetintemel U, Stonebraker M, Zdonik S. High-availability algorithms for distributed stream processing. In: Proceedings of the 21st International Eonference on Data Engineering; 2005. p. 779–90.

    Google Scholar 

  8. Hwang JH, Xing Y, Çetintemel U, Zdonik S. A cooperative, self-configuring high-availability solution for stream processing. In: Proceedings of the 23rd International Conference on Data Engineering; 2007. p. 176–85.

    Google Scholar 

  9. Kawell L, Beckhardt S, Halvorsen T, Ozzie R, Greif I. Replicated document management in a group communication system. In: Proceedings of the ACM Conference on Computer-Supported Cooperative Work; 1988.

    Google Scholar 

  10. Schiper A, Toueg S. From set membership to group membership: a separation of concerns. IEEE Trans Dependable Secure Comput. 2006;3(1):2–12.

    Article  Google Scholar 

  11. Schneider FB. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput Surv. 1990;22(4):299–319.

    Article  Google Scholar 

  12. Schneider FB. What good are models and what models are good? In: Distributed systems. 2nd ed. ACM/Addison-Wesley Publishing; 1993, p. 17–26.

    Google Scholar 

  13. Shah MA. Flux: a mechanism for building robust, scalable dataflows. Ph.D. thesis, University of California, Berkeley; 2004.

    Google Scholar 

  14. Shah M, Hellerstein J, Brewer E. Highly-available, fault-tolerant, parallel dataflows. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2004. p. 827–38.

    Google Scholar 

  15. Terry DB, Theimer M, Petersen K, Demers AJ, Spreitzer M, Hauser C. Managing update conflicts in Bayou, a weakly connected replicated storage system. In: Proceedings of the 15th ACM Symposium on Operating System Principles; 1995. p. 172–83.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Magdalena Balazinska .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Balazinska, M., Hwang, JH., Shah, M.A. (2018). Fault Tolerance and High Availability in Data Stream Management Systems. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_160

Download citation

Publish with us

Policies and ethics