Integrating workload balancing and fault tolerance in distributed stream processing system

Fang, Junhua; Chao, Pingfu; Zhang, Rong; Zhou, Xiaofang

doi:10.1007/s11280-018-0656-0

Integrating workload balancing and fault tolerance in distributed stream processing system

Published: 07 January 2019

Volume 22, pages 2471–2496, (2019)
Cite this article

World Wide Web Aims and scope Submit manuscript

Junhua Fang^1,4,
Pingfu Chao²,
Rong Zhang³ &
…
Xiaofang Zhou^1,2

695 Accesses
8 Citations
Explore all metrics

Abstract

Distributed Stream Processing Engine (DSPE) is designed for processing continuous streams so as to achieve the real-time performance with low latency guaranteed. To satisfy such requirement, the availability and efficiency are the main concern of the DSPE system, which can be achieved by a proper design of the fault tolerance module and the workload balancing module, respectively. However, the inherent characteristics of data streams, including persistence, dynamic and unpredictability, pose great challenges in satisfying both properties. As far as we know, most of the state-of-the-art DSPE systems take either fault tolerance or workload balancing as its single optimization goal, which in turn receives a higher resource overhead or longer recovery time. In this paper, we combine the fault tolerance and workload balancing mechanisms in the DSPE to reduce the overall resource consumption while keeping the system interactive, high-throughput, scalable and highly available. Based on our data-level replication strategy, our method can handle the dynamic data skewness and node failure scenario: during the distribution fluctuation of the incoming stream, we rebalance the workload by selectively inactivate the data in high-load nodes and activate their replicas on low-load nodes to minimize the migration overhead within the stateful operator; when a fault occurs in the process, the system activates the replicas of the data affected to ensure the correctness while keeping the workload balanced. Extensive experiments on various join workloads on both benchmark data and real data show our superior performance compared with baseline systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

A survey on the evolution of stream processing systems

Article Open access 22 November 2023

Marios Fragkoulis, Paris Carbone, … Asterios Katsifodimos

Supporting efficient video file streaming in P2P cloud storage

Article 04 April 2024

Jinsung Kim & Eunsam Kim

References

Apache Storm. http://storm.apache.org/
Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.B.: Aurora: a new model and architecture for data stream management. VLDBJ 12(2), 120–139 (2003)
Article Google Scholar
Aniello, L., Baldoni, R., Querzoni, L.: Adaptive online scheduling in storm. In: DEBS, pp 207–218 (2013)
Balazinska, M., Balakrishnan, H., Madden, S.R., Stonebraker, M.: Fault-tolerance in the borealis distributed stream processing system. ACM Trans. Database Syst. (TODS) 33(1), 3 (2008)
Article Google Scholar
Bellavista, P., Corradi, A., Kotoulas, S., Reale, A.: Adaptive fault-tolerance for dynamic resource provisioning in distributed stream processing systems. In: EDBT, pp. 85–96 (2014)
Castro Fernandez, R., Migliavacca, M., Kalyvianaki, E., Pietzuch, P.: Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of the Integrating Scale ACM SIGMOD International Conference on Management of Data, p 2013. ACM (2013)
Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., Zdonik, S.B.: Scalable distributed stream processing. In: CIDR, vol. 3, pp 257–268 (2003)
Coffman Jr, E.G., Garey, M.R., Johnson, D.S.: Approximation algorithms for bin-packing<^a an updated survey. In: Algorithm Design for Computer System Design, pp 49–106. Springer (1984)
Elseidy, M., Elguindy, A., Vitorovic, A., Koch, C.: Scalable and adaptive online joins. VLDB 7(6), 441–452 (2014)
Google Scholar
Fang, J., Zhang, R., Fu, T.Z., Zhang, Z., Zhou, A., Zhu, J.: Parallel stream processing against workload skewness and variance. In: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, pp 15–26. ACM (2017)
Fu, T.Z.J., Ding, J., Ma, R.T.B., Winslett, M., Yang, Y., Zhang, Z.: Drs: dynamic resource scheduling for real-time analytics over fast streams. In: ICDCS, pp 411–420. IEEE, Columbus (2015)
Gedik, B.: Partitioning functions for stateful data parallelism in stream processing. VLDBJ 23(4), 517–539 (2014)
Article Google Scholar
Ghanbari, H., Simmons, B., Litoiu, M., Iszlai, G.: Exploring alternative approaches to implement an elasticity policy. In: 2011 IEEE International Conference on Cloud Computing (CLOUD), pp. 716–723. IEEE (2011)
Heath, T., Martin, R.P., Nguyen, T.D.: Improving cluster availability using workstation validation. In: ACM SIGMETRICS Performance Evaluation Review, vol. 30, pp. 217–227. ACM (2002)
Heinze, T., Zia, M., Krahn, R., Jerzak, Z., Fetzer, C.: An adaptive replication scheme for elastic data stream processing systems. In: Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems, pp. 150–161. ACM (2015)
Hwang, J.-H., Balazinska, M., Rasin, A., Cetintemel, U., Stonebraker, M., Zdonik, S.: High-availability algorithms for distributed stream processing. In: Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on, pp. 779–790. IEEE (2005)
Hwang, J.-H., Cetintemel, U., Zdonik, S.: Fast and highly-available stream processing over wide area networks. In: Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, pp. 804–813. IEEE (2008)
Hwang, J.-H., Xing, Y., Cetintemel, U., Zdonik, S.: A cooperative, self-configuring high-availability solution for stream processing. In: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pp. 176–185. IEEE (2007)
Jacques-Silva, G., Gedik, B., Andrade, H., Wu, K.-L., Iyer, R.K.: Fault injection-based assessment of partial fault tolerance in stream processing applications. In: Proceedings of the 5th ACM International Conference on Distributed Event-Based System, pp. 231–242. ACM (2011)
Ji, Y., Nica, A., Jerzak, Z., Hackenbroich, G., Fetzer, C.: Quality-driven disorder handling for concurrent windowed stream queries with shared operators. In: Proceedings of the 10th ACM International Conference on Distributed and Event-Based Systems, pp. 25–36. ACM (2016)
Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In: STOC, pp. 654–663 (1997)
Katsipoulakis, N.R., Labrinidis, A., Chrysanthis, P.K.: A holistic view of stream partitioning costs. VLDB 10(11), 1286–1297 (2017)
Google Scholar
Khandekar, R., Hildrum, K., Parekh, S., Rajan, D., Wolf, J., Wu, K.-L., Andrade, H., Gedik, B.: Cola: Optimizing stream processing applications via graph partitioning. In: Middleware, pp 308–327 (2009)
Lin, Q., Ooi, B.C., Wang, Z., Yu, C.: Scalable distributed stream join processing. In: SIGMOD, pp. 811–825 (2015)
Nasir, M.A.U., Morales, G.D.F., García-Soriano, D., Kourtellis, N., Serafini, M.: The power of both choices: Practical load balancing for distributed stream processing engines. ICDE (2015)
Nasir, M.A.U., Serafini, M., et al.: When two choices are not enough: Balancing at scale in distributed stream processing. In: ICDE (2016)
Noghabi, S.A., Paramasivam, K., Pan, Y., Ramesh, N., Bringhurst, J., Gupta, I., Campbell, R.H.: Samza: Stateful scalable stream processing at linkedin. VLDB 10(12), 1634–1645 (2017)
Google Scholar
Qian, Z., He, Y., Su, C., Wu, Z., Zhu, H., Zhang, T., Zhou, L., Yu, Y., Zhang, Z.: Timestream: Reliable stream computation in the cloud. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 1–14. ACM (2013)
Rupprecht, L., Culhane, W., Pietzuch, P.: Squirreljoin: Network-aware distributed join processing with lazy partitioning. Proceedings of the VLDB Endowment 10(11), 1250–1261 (2017)
Article Google Scholar
Salama, A., Binnig, C., Kraska, T., Zamanian, E.: Cost-based fault-tolerance for parallel data processing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 285–297. ACM (2015)
Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. on Dependable and Secure Comput. 7(4), 337–350 (2010)
Article Google Scholar
Su, L., Zhou, Y.: Tolerating correlated failures in massively parallel stream processing engines. In: ICDE, pp. 517–528 (2016)
Su, L., Zhou, Y.: Passive and partially active fault tolerance for massively parallel stream processing engines. TKDE (2017)
Upadhyaya, P., Kwon, Y., Balazinska, M.: A latency and fault-tolerance optimizer for online parallel query plans. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 241–252. ACM (2011)
Vishwanath, K.V., Nagappan, N.: Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 193–204. ACM (2010)
Wolf, J., Bansal, N., Hildrum, K., Parekh, S., Rajan, D., Wagle, R., Wu, K.-L., Fleischer, L.: Soda: an optimizing scheduler for large-scale stream-based distributed computer systems. In: Middleware, pp 306–325 (2008)
Xing, Y., Hwang, J., Cetintemel, U., Zdonik, S.: Providing resiliency to load variations in distributed stream processing. In: VLDB, pp. 775–786 (2006)
Xing, Y., Zdonik, S., Hwang, J.: Dynamic load distribution in the borealis stream processor. In: ICDE, pp. 791–802 (2005)
Zaharia, M., Das, T., Li, H., Hunt, T.: Discretized streams: Fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423–438. ACM (2013)

Download references

Acknowledgments

This work is partially supported by National Science Foundation of China under grant (No. 61802273, 61572194, 61772356, 61572335, and 61836007), Postdoctoral Research Foundation of China (2017M621813), Postdoctoral Science Foundation of Jiangsu Province (2018K029C), Natural science fund for colleges and universities in Jiangsu Province (18KJB520044), and Suzhou Science and Technology Development Program(SYG201803). This work is also supported by the Open Program of Neusoft Corporation(SKLSAOP1801) and Blockheaders Co. Ltd.

Author information

Authors and Affiliations

Institute of Artificial Intelligence, School of Computer Science and Technology, Soochow University, Suzhou, 215006, China
Junhua Fang & Xiaofang Zhou
School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, QLD, 4072, Australia
Pingfu Chao & Xiaofang Zhou
School of Data Science and Engineering, East China Normal University, Shanghai, 200062, China
Rong Zhang
Neusoft Corporation, Shenyang, 110179, China
Junhua Fang

Authors

Junhua Fang
View author publications
You can also search for this author in PubMed Google Scholar
Pingfu Chao
View author publications
You can also search for this author in PubMed Google Scholar
Rong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofang Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junhua Fang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fang, J., Chao, P., Zhang, R. et al. Integrating workload balancing and fault tolerance in distributed stream processing system. World Wide Web 22, 2471–2496 (2019). https://doi.org/10.1007/s11280-018-0656-0

Download citation

Received: 05 April 2018
Revised: 06 December 2018
Accepted: 17 December 2018
Published: 07 January 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11280-018-0656-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integrating workload balancing and fault tolerance in distributed stream processing system

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A survey on the evolution of stream processing systems

Supporting efficient video file streaming in P2P cloud storage

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Integrating workload balancing and fault tolerance in distributed stream processing system

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A survey on the evolution of stream processing systems

Supporting efficient video file streaming in P2P cloud storage

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation