Abstract
Distributed stream processing system (DSPS) has proven to be an effective way to process and analyze large-scale data streams in real-time fashions. The reliability problem of DSPS is becoming a popular topic in recent years. Novel elastic DSPSs provide the ability to seamlessly adapt to stream workload changes, which introduce new reliability challenges: (1) operators can be scaled up and down at runtime, requiring fault tolerant methods to maintain data backup consistency under the runtime dynamics. (2) Rollback recovery to the last checkpoint may undo recent auto-scaling adjustments, which will introduce high cost and unacceptable impact to the system. In this paper, we put forward a novel fault-tolerant mechanism to deal with these issues. In particular, we propose a self-adaptive backup unit, elastic data slice (EDS), that can partition and merge data backups according to operator auto-scaling at runtime. The consistency of recovery is guaranteed by new upstream backup protocols, which restart the system from the status after auto-scaling instead of last checkpoint and avoid high recovery latency. Based on them, we implement a prototype system named SPATE. Evaluations on SPATE show that our mechanism supports auto-scaling changes with similar overhead compared to existing approaches, while achieving low recovery latency despite auto-scaling.
Similar content being viewed by others
References
Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. VLDB J. 12(2), 120–139 (2003)
Balazinska, M., Balakrishnan, H., Madden, S.R., Stonebraker, M.: Fault-tolerance in the borealis distributed stream processing system. ACM Trans. Database Syst. (TODS) 33(1), 3 (2008)
Brito, A., Martin, A., Knauth, T., Creutz, S., Becker, D., Weigert, S., Fetzer, C.: Scalable and low-latency data processing with stream mapreduce. In: Proceedings of the IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom), 2011 , IEEE, pp. 48–58 (2011)
Castro Fernandez, R., Migliavacca, M., Kalyvianaki, E., Pietzuch, P.: Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM, pp. 725–736 (2013)
Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M.J., Hellerstein, J.M., Hong, W., Krishnamurthy, S., Madden, S.R., Reiss, F., Shah, M.A.: Telegraphcq: continuous dataflow processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, ACM, pp. 668–668 (2003)
Chen, Q., Hsu, M., Malu, C.: Fault tolerant distributed stream processing based on backtracking. Int. J. Netw. Distrib. Comput. 1(4), 226–238 (2013)
Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., Zdonik, S.B.: Scalable distributed stream processing. CIDR 3, 257–268 (2003)
de Assuncao, M.D., da Silva, A., Buyya, R.: Distributed data stream processing and edge computing: a survey on resource elasticity and future directions. J. Netw. Comput. Appl. 103, 1–17 (2018)
De Matteis, T., Mencagli, G.: Elastic scaling for distributed latency-sensitive data stream operators. In: Proccedings of the 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), IEEE, pp. 61–68 (2017)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Gedik, B., Schneider, S., Hirzel, M., Wu, K.L.: Elastic scaling for data stream processing. Parallel Distrib. Syst. IEEE Trans. 25(25), 1447–1463 (2014)
Gu, Y., Zhang, Z., Ye, F., Yang, H., Kim, M., Lei, H., Liu, Z.: An empirical study of high availability in stream processing systems. In: Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware, Springer-Verlag New York, Inc., p. 23 (2009)
Gulisano, V., Jimenez-Peris, R., Patino-Martinez, M., Soriente, C., Valduriez, P.: Streamcloud: an elastic and scalable data streaming system. Parallel Distrib. Syst. IEEE Trans. 23(12), 2351–2365 (2012)
He, B., Yang, M., Guo, Z., Chen, R., Su, B., Lin, W., Zhou, L.: Comet: batched stream processing for data intensive distributed computing. In: Proceedings of the 1st ACM Symposium on Cloud Computing, ACM, pp. 63–74 (2010)
He, F., Wei, P.: Research on comprehensive point of interest (poi) recommendation based on spark. Clust. Comput. (2018). https://doi.org/10.1007/s10586-018-2061-y
Heinze, T., Zia, M., Krahn, R., Jerzak, Z., Fetzer, C.: An adaptive replication scheme for elastic data stream processing systems. In: Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems, ACM, pp. 150–161 (2015)
Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: Zookeeper: Wait-free coordination for internet-scale systems. In: USENIX Annual Technical Conference, Boston, MA, USA, vol. 8 (2010)
Hwang, J.H., Balazinska, M., Rasin, A., Çetintemel, U., Stonebraker, M., Zdonik, S.: High-availability algorithms for distributed stream processing. In: Proceedings of the 21st International Conference on Data Engineering. ICDE 2005, IEEE, pp. 779–790 (2005)
Imai, S., Patterson, S., Varela, C.A.: Uncertainty-aware elastic virtual machine scheduling for stream processing systems. In: 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), IEEE, pp. 62–71 (2018)
Javed, M.H., Lu, X., Panda, D.K.: Cutting the tail: designing high performance message brokers to reduce tail latencies in stream processing. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), IEEE, pp. 223–233 (2018)
Koldehofe, B., Mayer, R., Ramachandran, U., Rothermel, K., Völz, M.: Rollback-recovery without checkpoints in distributed event processing systems. In: Proceedings of the 7th ACM International Conference on Distributed Event-Based Systems, ACM, pp. 27–38 (2013)
Li, H., Wu, J., Jiang, Z., Li, X., Wei, X.: Minimum backups for stream processing with recovery latency guarantees. IEEE Trans. Reliab. PP(99), 1–12 (2017)
Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: Sparkbench: a spark benchmarking suite characterizing large-scale in-memory data analytics. Clust. Comput. 20(3), 2575–2589 (2017)
Liu Z., Huang H., He Q., Chiew K., Gao Y.: Rare category exploration on llnear time complexity. In: Renz M., Shahabi C., Zhou X., Cheema M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science, vol. 9050, pp. 37–54. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18123-3_3
Lohrmann, B., Janacik, P., Kao, O.: Elastic stream processing with latency guarantees. In: IEEE International Conference on Distributed Computing Systems, pp. 399–410 (2015)
Lombardi, F., Aniello, L., Bonomi, S., Querzoni, L.: Elastic symbiotic scaling of operators and resources in stream processing systems. IEEE Trans. Parallel Distrib. Syst. 29(3), 572–585 (2018)
Martin, A., Fetzer, C., Brito, A.: Active replication at (almost) no cost. In: 2011 30th IEEE Symposium on Reliable Distributed Systems (SRDS), IEEE, pp. 21–30 (2011)
Marz, N.: Storm: distributed and fault-tolerant realtime computation (2013)
Mencagli, G., Torquati, M., Danelutto, M.: Elastic-ppq: a two-level autonomic system for spatial preference query processing over dynamic data streams. Future Gener. Comput. Syst. 79, 862–877 (2018)
Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: Distributed stream computing platform. In: 2010 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp. 170–177 (2010)
Qian, Z., He, Y., Su, C., Wu, Z., Zhu, H., Zhang, T., Zhou, L., Yu, Y., Zhang, Z.: Timestream: reliable stream computation in the cloud. In: Proceedings of the 8th ACM European Conference on Computer Systems, ACM, pp. 1–14 (2013)
Sîrbu, A., Babaoglu, O.: Towards operator-less data centers through data-driven, predictive, proactive autonomics. Clust. Comput. 19(2), 865–878 (2016)
Sumalatha, M., Ananthi, M.: Efficient data retrieval using adaptive clustered indexing for continuous queries over streaming data. Clust. Comput. (2017). https://doi.org/10.1007/s10586-017-1093-z
Wang, H., Peh, L.S., Koukoumidis, E., Tao, S., Chan, M.C.: Meteor shower: a reliable stream processing system for commodity data centers. In: 2012 IEEE 26th International on Parallel & Distributed Processing Symposium (IPDPS), IEEE, pp. 1180–1191 (2012)
Wei, X., Xiang, L., Hongliang, L., Cong, L., Yuan, Z.: Flexible online mapreduce model and topology protocols supporting large-scale stream data processing. J. Jilin Univ. (Eng. Technol. Edn.) 46(4), 1222–1231 (2016)
Wei, X., Li, L., Li, X., Wang, X., Gao, S., Li, H.: Pec: proactive elastic collaborative resourcescheduling in data stream processing. In: Proceedings of the IEEE Transactions on Parallel and Distributed Systems (2019)
Wu, Y., Tan, K.L.: Chronostream: elastic stateful stream computation in the cloud. In: IEEE International Conference on Data Engineering, pp. 723–734 (2015)
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: a fault-tolerant model for scalable stream processing. Technical Report, DTIC Document (2012)
Zhang, Z., Gu, Y., Ye, F., Yang, H., Kim, M., Lei, H., Liu, Z.: A hybrid approach to high availability in stream processing systems. In: 2010 IEEE 30th International Conference on Distributed Computing Systems (ICDCS), IEEE, pp. 138–148 (2010)
Acknowledgements
This work was supported by the National Natural Science Foundation of China (NSFC) (Grant Nos. 61602205 and 61772228), the National Key Research and Development Program of China (Grant Nos. 2017YFC1502306, 2016YFB0201503 and 2016YFB0701101), the Major Special Research Project of Science and Technology Department of Jilin Province (20160203008GX), and the Jilin Scientific and Technological Development Program (20170520066JH).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wei, X., Zhuang, Y., Li, H. et al. Reliable stream data processing for elastic distributed stream processing systems. Cluster Comput 23, 555–574 (2020). https://doi.org/10.1007/s10586-019-02939-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-019-02939-9