Abstract
Rapid processing with low-latency and high-throughput is a critical requirement for the applications of big data streams. However, the interferences among stream processing tasks in a data center decrease the utilization of the computational resources and prolong the latency of the tasks. Thus, we study an optimal scheduling method for processing a big data stream on heterogeneous servers with multicores in a data center. We model the big data stream processing and the scheduling problem with four objects or factors which are streaming data items, processing tasks, computational nodes and the cores inside each computational node. An interference model based on regression analysis and a prediction model based on the Autoregressive Integrated Moving Average are presented. Then, we propose a two-stage scheduling method including the fine-grained core scheduling and the coarse-grained node scheduling. In the core scheduling stage, we design a core scheduling algorithm named CS_TDF. In the node scheduling stage, we design a node scheduling algorithm named NS_ITF for a single time window and a continuous scheduling algorithm named PS_UIM for the entire data stream in all time windows. The experimental results show that our scheduling method achieves low interference and high computational resource utilization.
Similar content being viewed by others
Data availability
Enquiries about data availability should be directed to the authors.
References
Guo, J., Chang, Z.H., Wang, S., et al.: Who limits the resource efficiency of my datacenter: an analysis of Alibaba datacenter traces. In: The International Symposium, pp. 1–10 (2019)
He, K., Meng, X., Pan, Z., et al.: A novel task-duplication based clustering algorithm for heterogeneous computing environments. IEEE Trans. Parallel Distrib. Syst. 30(1), 2–14 (2019)
Gao, G., Xiao, M., Wu, J., et al.: Opportunistic mobile data offloading with deadline constraints. IEEE Trans. Parallel Distrib. Syst. 28(12), 3584–3599 (2017)
Barika, M., Garg, S., Chan, A., et al.: Scheduling algorithms for efficient execution of stream workflow applications in multicloud environments. IEEE Trans. Serv. Comput. 15(2), 860–875 (2022)
Zhang, H., Geng, X., Ma, H.: Learning-driven interference-aware workload parallelization for streaming applications in heterogeneous cluster. IEEE Trans. Parallel Distrib. Syst. 32(1), 1–15 (2021)
Barika, M., Garg, S., Zomaya, A.Y., et al.: Online scheduling technique to handle data velocity changes in stream workflows. IEEE Trans. Parallel Distrib. Syst. 32(8), 2115–2130 (2021)
Li, W., Liu, D., Chen, K., et al.: Hone: mitigating stragglers in distributed stream processing with tuple scheduling. IEEE Trans. Parallel Distrib. Syst. 32(8), 2021–2034 (2021)
Liu, S., Weng, J., Wang, J.H., et al.: An adaptive online scheme for scheduling and resource enforcement in Storm. IEEE/ACM Trans. Netw. 27(4), 1373–1386 (2019)
Fu, M., Mittal, S., Kedigehalli, V., et al.: Streaming@Twitter. IEEE Data Eng. Bull. 38(4), 15–27 (2015)
Peng B., Hosseini, M., Hong, Z., et al.: R-storm: resource-aware scheduling in storm. In: Proceedings of the 16th Annual Middleware Conference, pp. 149–161 (2015)
Shukla, A., Simmhan, Y.: Model-driven scheduling for distributed stream processing systems. J. Parallel Distrib. Comput. 117(1), 98–114 (2018)
Huang, X., Shao, Z., Yang, Y.: POTUS: predictive online tuple scheduling for data stream processing systems. IEEE Trans. Cloud Comput. (2020). https://doi.org/10.1109/TCC.2020.3032577
Heintz, B., Chandra, A., Sitaraman, R.K.: Optimizing timeliness and cost in geo-distributed streaming analytics. IEEE Trans. Cloud Comput. 8(1), 232–245 (2020)
Sun, D., Gao, S., Liu, X., et al.: A multi-level collaborative framework for elastic stream computing systems. Futur. Gener. Comput. Syst. 128, 117–131 (2022)
Li, H., Fang, H., Dai, H., et al.: A cost-efficient scheduling algorithm for streaming processing applications on cloud. Clust. Comput. (2022). https://doi.org/10.1007/s10586-021-03462-6
Li, H., Dai, H., Liu, Z., et al.: Dynamic energy-efficient scheduling for streaming applications in storm. Computing 104(2), 413–432 (2022)
KhudaBukhsh, W.R., Kar, S., Alt, B., et al.: Generalized cost-based job scheduling in very large heterogeneous cluster systems. IEEE Trans. Parallel Distrib. Syst. 31(11), 2594–2604 (2020)
Liang, W., Hu, C., Wu, M., et al.: A data intensive heuristic approach to the two-stage streaming scheduling problem. J. Comput. Syst. Sci. 89(1), 64–79 (2017)
Jin, H., Chen, F., Wu, S., et al.: Towards low-latency batched stream processing by pre-scheduling. IEEE Trans. Parallel Distrib. Syst. 30(3), 710–722 (2018)
Li, T., Tang, J., Xu, J.: Performance modeling and predictive scheduling for distributed stream data processing. IEEE Trans. Big Data 2(4), 353–364 (2016)
Liu, X., Buyya, R.: Performance-oriented deployment of streaming applications on cloud. IEEE Trans. Big Data 5(1), 46–59 (2017)
Shen, J., Varbanescu, A.L., Lu, Y., et al.: Workload partitioning for accelerating applications on heterogeneous platforms. IEEE Trans. Parallel Distrib. Syst. 27(9), 2766–2780 (2016)
Wei, X., Li, L., Li, X., et al.: Pec: proactive elastic collaborative resource scheduling in data stream processing. IEEE Trans. Parallel Distrib. Syst. 30(7), 1628–1642 (2019)
Min, C., Eom, Y.I.: Dynamic scheduling of irregular stream programs toward many-core scalability. IEEE Trans. Parallel Distrib. Syst. 26(6), 1594–1607 (2015)
Huang, J., Li, R., Wei, Y., et al.: Bi-directional timing-power optimisation on heterogeneous multi-core architectures. IEEE Trans. Sustain. Comput. 6(4), 572–585 (2021)
Zhao, J.C., Cui, H.M., Xue, J.L., et al.: Predicting cross-core performance interference on multicore processors with regression analysis. IEEE Trans. Parallel Distrib. Syst. 27(5), 1443–1456 (2016)
Buddhika, T., Stern, R., Lindburg, K., Pallickara, S., et al.: Online scheduling and interference alleviation for low-latency, high-throughput processing of data streams. IEEE Trans. Parallel Distrib. Syst. 28(12), 3553–3569 (2017)
Mars, J., Tang, L.: Chapter 2—understanding application contentiousness and sensitivity on modern multicores. In: Advances in Computers, pp. 59–85 (2013)
Mars, J., Tang, L., Hundt, R., et al.: Bubble-up: increasing utilization in modern warehouse scale computers via sensible co-locations. In: 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 248–259 (2011)
Guo, J., Ma, A., Yan, Y., et al.: Application performance prediction method based on cross-core performance interference on multi-core processor. Microprocess. Microsyst. 47(Part A), 112–120 (2016)
Babu, C.N., Reddy, B.E.: A moving-average filter based hybrid ARIMA–ANN model for forecasting time series data. Appl. Soft Comput. 23, 27–38 (2014)
Shukla, A., Chaturvedi, S., Simmhan, Y.: RIoTBench: an IoT benchmark for distributed stream processing systems. Concurr. Comput. Pract. Exp. 29(21), 1–22 (2017)
Nechifor, S., Stefan, I., Fischer, M., et al.: Event detection for urban dynamic data streams. In: 2016 IEEE 16th International Conference on Data Mining Workshops, pp. 53–60 (2016)
Goyal, P., Kaushik, P., Gupta, P., et al.: Multilevel event detection, storyline generation, and summarization for tweet streams. IEEE Trans. Computat. Soc. Syst. 7(1), 8–23 (2020)
Funding
This work was supported by the National Key R&D Program of China (Grant No. 2019YFB1704100), National Natural Science Foundation of China (Grant No. 62072337), National Social Science Foundation of China (Grant No. 17BTQ086), Subproject of National Seafloor Observatory System of China (Grant No. 2970000001/001/016).
Author information
Authors and Affiliations
Contributions
SW proposes the idea of this paper. SW collects a large amount of information, designs, and conducts experiments. SW is responsible for checking the experimental results. GZ has put forward constructive suggestions for this research. This paper was written by SW and checked by GZ.
Corresponding author
Ethics declarations
Competing interests
The authors have not disclosed any competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, S., Zeng, Gs. Two-stage scheduling for a fluctuant big data stream on heterogeneous servers with multicores in a data center. Cluster Comput 27, 1581–1597 (2024). https://doi.org/10.1007/s10586-023-04044-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-023-04044-4