Integrating workload balancing and fault tolerance in distributed stream processing system


Distributed Stream Processing Engine (DSPE) is designed for processing continuous streams so as to achieve the real-time performance with low latency guaranteed. To satisfy such requirement, the availability and efficiency are the main concern of the DSPE system, which can be achieved by a proper design of the fault tolerance module and the workload balancing module, respectively. However, the inherent characteristics of data streams, including persistence, dynamic and unpredictability, pose great challenges in satisfying both properties. As far as we know, most of the state-of-the-art DSPE systems take either fault tolerance or workload balancing as its single optimization goal, which in turn receives a higher resource overhead or longer recovery time. In this paper, we combine the fault tolerance and workload balancing mechanisms in the DSPE to reduce the overall resource consumption while keeping the system interactive, high-throughput, scalable and highly available. Based on our data-level replication strategy, our method can handle the dynamic data skewness and node failure scenario: during the distribution fluctuation of the incoming stream, we rebalance the workload by selectively inactivate the data in high-load nodes and activate their replicas on low-load nodes to minimize the migration overhead within the stateful operator; when a fault occurs in the process, the system activates the replicas of the data affected to ensure the correctness while keeping the workload balanced. Extensive experiments on various join workloads on both benchmark data and real data show our superior performance compared with baseline systems.

This is a preview of subscription content, log in to check access.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11


  1. 1.

    Apache Storm.

  2. 2.

    Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.B.: Aurora: a new model and architecture for data stream management. VLDBJ 12(2), 120–139 (2003)

    Article  Google Scholar 

  3. 3.

    Aniello, L., Baldoni, R., Querzoni, L.: Adaptive online scheduling in storm. In: DEBS, pp 207–218 (2013)

  4. 4.

    Balazinska, M., Balakrishnan, H., Madden, S.R., Stonebraker, M.: Fault-tolerance in the borealis distributed stream processing system. ACM Trans. Database Syst. (TODS) 33(1), 3 (2008)

    Article  Google Scholar 

  5. 5.

    Bellavista, P., Corradi, A., Kotoulas, S., Reale, A.: Adaptive fault-tolerance for dynamic resource provisioning in distributed stream processing systems. In: EDBT, pp. 85–96 (2014)

  6. 6.

    Castro Fernandez, R., Migliavacca, M., Kalyvianaki, E., Pietzuch, P.: Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of the Integrating Scale ACM SIGMOD International Conference on Management of Data, p 2013. ACM (2013)

  7. 7.

    Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., Zdonik, S.B.: Scalable distributed stream processing. In: CIDR, vol. 3, pp 257–268 (2003)

  8. 8.

    Coffman Jr, E.G., Garey, M.R., Johnson, D.S.: Approximation algorithms for bin-packing<a an updated survey. In: Algorithm Design for Computer System Design, pp 49–106. Springer (1984)

  9. 9.

    Elseidy, M., Elguindy, A., Vitorovic, A., Koch, C.: Scalable and adaptive online joins. VLDB 7(6), 441–452 (2014)

    Google Scholar 

  10. 10.

    Fang, J., Zhang, R., Fu, T.Z., Zhang, Z., Zhou, A., Zhu, J.: Parallel stream processing against workload skewness and variance. In: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, pp 15–26. ACM (2017)

  11. 11.

    Fu, T.Z.J., Ding, J., Ma, R.T.B., Winslett, M., Yang, Y., Zhang, Z.: Drs: dynamic resource scheduling for real-time analytics over fast streams. In: ICDCS, pp 411–420. IEEE, Columbus (2015)

  12. 12.

    Gedik, B.: Partitioning functions for stateful data parallelism in stream processing. VLDBJ 23(4), 517–539 (2014)

    Article  Google Scholar 

  13. 13.

    Ghanbari, H., Simmons, B., Litoiu, M., Iszlai, G.: Exploring alternative approaches to implement an elasticity policy. In: 2011 IEEE International Conference on Cloud Computing (CLOUD), pp. 716–723. IEEE (2011)

  14. 14.

    Heath, T., Martin, R.P., Nguyen, T.D.: Improving cluster availability using workstation validation. In: ACM SIGMETRICS Performance Evaluation Review, vol. 30, pp. 217–227. ACM (2002)

  15. 15.

    Heinze, T., Zia, M., Krahn, R., Jerzak, Z., Fetzer, C.: An adaptive replication scheme for elastic data stream processing systems. In: Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems, pp. 150–161. ACM (2015)

  16. 16.

    Hwang, J.-H., Balazinska, M., Rasin, A., Cetintemel, U., Stonebraker, M., Zdonik, S.: High-availability algorithms for distributed stream processing. In: Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on, pp. 779–790. IEEE (2005)

  17. 17.

    Hwang, J.-H., Cetintemel, U., Zdonik, S.: Fast and highly-available stream processing over wide area networks. In: Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, pp. 804–813. IEEE (2008)

  18. 18.

    Hwang, J.-H., Xing, Y., Cetintemel, U., Zdonik, S.: A cooperative, self-configuring high-availability solution for stream processing. In: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pp. 176–185. IEEE (2007)

  19. 19.

    Jacques-Silva, G., Gedik, B., Andrade, H., Wu, K.-L., Iyer, R.K.: Fault injection-based assessment of partial fault tolerance in stream processing applications. In: Proceedings of the 5th ACM International Conference on Distributed Event-Based System, pp. 231–242. ACM (2011)

  20. 20.

    Ji, Y., Nica, A., Jerzak, Z., Hackenbroich, G., Fetzer, C.: Quality-driven disorder handling for concurrent windowed stream queries with shared operators. In: Proceedings of the 10th ACM International Conference on Distributed and Event-Based Systems, pp. 25–36. ACM (2016)

  21. 21.

    Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In: STOC, pp. 654–663 (1997)

  22. 22.

    Katsipoulakis, N.R., Labrinidis, A., Chrysanthis, P.K.: A holistic view of stream partitioning costs. VLDB 10(11), 1286–1297 (2017)

    Google Scholar 

  23. 23.

    Khandekar, R., Hildrum, K., Parekh, S., Rajan, D., Wolf, J., Wu, K.-L., Andrade, H., Gedik, B.: Cola: Optimizing stream processing applications via graph partitioning. In: Middleware, pp 308–327 (2009)

  24. 24.

    Lin, Q., Ooi, B.C., Wang, Z., Yu, C.: Scalable distributed stream join processing. In: SIGMOD, pp. 811–825 (2015)

  25. 25.

    Nasir, M.A.U., Morales, G.D.F., García-Soriano, D., Kourtellis, N., Serafini, M.: The power of both choices: Practical load balancing for distributed stream processing engines. ICDE (2015)

  26. 26.

    Nasir, M.A.U., Serafini, M., et al.: When two choices are not enough: Balancing at scale in distributed stream processing. In: ICDE (2016)

  27. 27.

    Noghabi, S.A., Paramasivam, K., Pan, Y., Ramesh, N., Bringhurst, J., Gupta, I., Campbell, R.H.: Samza: Stateful scalable stream processing at linkedin. VLDB 10(12), 1634–1645 (2017)

    Google Scholar 

  28. 28.

    Qian, Z., He, Y., Su, C., Wu, Z., Zhu, H., Zhang, T., Zhou, L., Yu, Y., Zhang, Z.: Timestream: Reliable stream computation in the cloud. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 1–14. ACM (2013)

  29. 29.

    Rupprecht, L., Culhane, W., Pietzuch, P.: Squirreljoin: Network-aware distributed join processing with lazy partitioning. Proceedings of the VLDB Endowment 10(11), 1250–1261 (2017)

    Article  Google Scholar 

  30. 30.

    Salama, A., Binnig, C., Kraska, T., Zamanian, E.: Cost-based fault-tolerance for parallel data processing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 285–297. ACM (2015)

  31. 31.

    Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. on Dependable and Secure Comput. 7(4), 337–350 (2010)

    Article  Google Scholar 

  32. 32.

    Su, L., Zhou, Y.: Tolerating correlated failures in massively parallel stream processing engines. In: ICDE, pp. 517–528 (2016)

  33. 33.

    Su, L., Zhou, Y.: Passive and partially active fault tolerance for massively parallel stream processing engines. TKDE (2017)

  34. 34.

    Upadhyaya, P., Kwon, Y., Balazinska, M.: A latency and fault-tolerance optimizer for online parallel query plans. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 241–252. ACM (2011)

  35. 35.

    Vishwanath, K.V., Nagappan, N.: Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 193–204. ACM (2010)

  36. 36.

    Wolf, J., Bansal, N., Hildrum, K., Parekh, S., Rajan, D., Wagle, R., Wu, K.-L., Fleischer, L.: Soda: an optimizing scheduler for large-scale stream-based distributed computer systems. In: Middleware, pp 306–325 (2008)

  37. 37.

    Xing, Y., Hwang, J., Cetintemel, U., Zdonik, S.: Providing resiliency to load variations in distributed stream processing. In: VLDB, pp. 775–786 (2006)

  38. 38.

    Xing, Y., Zdonik, S., Hwang, J.: Dynamic load distribution in the borealis stream processor. In: ICDE, pp. 791–802 (2005)

  39. 39.

    Zaharia, M., Das, T., Li, H., Hunt, T.: Discretized streams: Fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423–438. ACM (2013)

Download references


This work is partially supported by National Science Foundation of China under grant (No. 61802273, 61572194, 61772356, 61572335, and 61836007), Postdoctoral Research Foundation of China (2017M621813), Postdoctoral Science Foundation of Jiangsu Province (2018K029C), Natural science fund for colleges and universities in Jiangsu Province (18KJB520044), and Suzhou Science and Technology Development Program(SYG201803). This work is also supported by the Open Program of Neusoft Corporation(SKLSAOP1801) and Blockheaders Co. Ltd.

Author information



Corresponding author

Correspondence to Junhua Fang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Fang, J., Chao, P., Zhang, R. et al. Integrating workload balancing and fault tolerance in distributed stream processing system. World Wide Web 22, 2471–2496 (2019).

Download citation


  • Real-time data processing
  • Distributed systems
  • Load Balancing
  • High availability computing