Skip to main content
Log in

Cost-effective and adaptive clustering algorithm for stream processing on cloud system

  • Published:
GeoInformatica Aims and scope Submit manuscript

Abstract

Clustering is a fundamental operation that plays an essential role in data management and analysis. Clustering algorithms have been well studied over the past two decades, but the real-time clustering has yet to be maturely applied. For applications based on clustering calculations, capturing the dynamic changes of clusters and trends of moving objects in a real-time manner can maximize the value of the data. Although the DSPE (D istributed S tream P rocessing E ngine) is capable of such workloads, it still faces the problems of fixed window size and computational resources waste. In this paper, we introduce a new C ost-e ffective and A daptive C lustering method (CeAC), which can improve computational efficiency while ensuring the accuracy of the clustering result. Specifically, we design a composite window model which contains the latest data records and maintains historical states. To achieve a lightweight clustering, we propose a fully online clustering algorithm based on grid density, which can capture clusters with arbitrary shape and effectively handle outliers in parallel. We further introduce an adaptive calculation model to accelerate the clustering operation by shedding workload according to the incoming data characteristic. Experimental results show that the proposed method is accurate and efficient in real-time data stream clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. (2021) Apache flink. https://flink.apache.org/

  2. Aggarwal CC (2018) A survey of stream clustering algorithms. In: Data clustering: algorithms and applications. CRC Press, pp 231–258

  3. Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the VLDB, pp 852–863

  4. Aggarwal CC, Yu PS, Han J, Wang J (2003) A framework for clustering evolving data streams. In: Proceedings of the VLDB, pp 81–92

  5. Akidau T, Schmidt E, Whittle S, Bradshaw R, Perry F (2015) The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc VLDB Endow 8(12):1792–1803

    Article  Google Scholar 

  6. Amini, Wah TY (2014) On density-based data streams clustering algorithms: A survey. J Comput Sci Technol 29(1):116–141

    Article  Google Scholar 

  7. Amini A, Wah TY et al (2013) Leaden-stream: A leader density-based clustering algorithm over evolving data stream. J Comput Sci Comm 1(05):26

    Article  Google Scholar 

  8. Baldassi C (2019) Recombinator-k-means: A population based algorithm that exploits k-means++ for recombination

  9. Botan I, Derakhshan R, Dindar N, Haas L, Miller RJ, Tatbul N (2010) Secret: a model for analysis of the execution semantics of stream processing systems. Proc VLDB Endow 3(1-2):232–243

    Article  Google Scholar 

  10. Cao F, Estert M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the SIAM, pp 328–339

  11. Chakrabarti D, Kumar R, Tomkins A (2006) Evolutionary clustering. In: Proceedings of the SIGKDD, pp 554–560

  12. Chen L, Shang S, Jensen CS, Xu J, Kalnis P, Yao B, Shao L (2020) Top-k term publish/subscribe for geo-textual data streams. VLDB J 29 (5):1101–1128

    Article  Google Scholar 

  13. Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the SIGKDD, pp 133–142

  14. Datar M, Gionis A, Indyk Pi, Motwani R (2002) Maintaining stream statistics over sliding windows. SIAM J Comput 31(6):1794–1813

    Article  MathSciNet  MATH  Google Scholar 

  15. Silva JA, Faria ER, Barros RC, Hruschka ER, Carvalho ACD, Gama J (2013) Data stream clustering: A survey. ACM Comput Surv 46(1):1–31

    Article  MATH  Google Scholar 

  16. de Andrade Silva J, Hruschka ER, Gama J (2017) An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Syst Appl 67:228–238

    Article  Google Scholar 

  17. Din SU, Shao J, Kumar J, Ali W, Liu J, Ye Y (2020) Online reliable semi-supervised learning on evolving data streams. Inf Sci 525:153–171

    Article  MathSciNet  MATH  Google Scholar 

  18. Gan J, Tao Y (2017) Dynamic density based clustering. In: Proceedings of the SIGMOD, pp 1493–1507

  19. Gong S, Zhang Y, Yu G (2017) Clustering stream data by exploring the evolution of density mountain. Proc VLDB Endow 11(4):393–405

    Article  Google Scholar 

  20. Hahsler M, Bolaños M (2016) Clustering data streams based on shared density between micro-clusters. IEEE Trans Knowl Data Eng 28(6):1449–1461

    Article  Google Scholar 

  21. Han J, Pei J, Kamber M (2011) Data Mining: Concepts and Techniques, 3rd edition Morgan Kaufmann

  22. Isaksson C, Dunham MH, Hahsler M (2012) Sostream: Self organizing density-based clustering over data stream. In: Proceedings of the MLDM, pp 264–278

  23. Li Y, Li H, Wang Z, Liu B, Cui J, Fei H (2020) Esa-stream: Efficient self-adaptive online data stream clustering. IEEE Trans Knowl Data Eng

  24. Liu A, Wang W, Shang S, Li Q, Zhang X (2018) Efficient task assignment in spatial crowdsourcing with worker and task privacy protection. GeoInformatica 22(2):335–362

    Article  Google Scholar 

  25. Liu J, Zhao K, Sommer P, Shang S, Kusy B, Lee JG, Jurdak R (2016) A novel framework for online amnesic trajectory compression in resource-constrained environments. IEEE Trans Knowl Data Eng 28 (11):2827–2841

    Article  Google Scholar 

  26. Liu X, Buyya R (2020) Resource management and scheduling in distributed stream processing systems: A taxonomy, review, and future directions. ACM Comput Surv 53(3):1–41

    Article  Google Scholar 

  27. Lu J, Liu A, Dong F, Gu F, Gama J, Zhang G (2019) Learning under concept drift: A review. IEEE Trans Knowl Data Eng 31(12):2346–2363

    Google Scholar 

  28. Mansalis S, Ntoutsi E, Pelekis N, Theodoridis Y (2018) An evaluation of data stream clustering algorithms. Stat Anal Data Min 11(4):167–187

    Article  MathSciNet  MATH  Google Scholar 

  29. Nguyen H-L, Woon YK, Ng W-K (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45(3):535–569

    Article  Google Scholar 

  30. Nguyen H-L, Woon YK, Ng W-K (2015) A survey on data stream clustering and classification. Knowl. Inf Syst 45(3):535–569

    Article  Google Scholar 

  31. Pei Y, Zaïane O (2006) A synthetic data generator for clustering and outlier analysis. Technical Report

  32. Puschmann D, Barnaghi PaM., Tafazolli R (2017) Adaptive clustering for dynamic IOT data streams. IEEE Internet Things J 4(1):64–74

    Article  Google Scholar 

  33. Rasool Z, Zhou R, Chen L, Liu C, Xu J (2020) Index-based solutions for efficient density peaks clustering. IEEE Trans Knowl Data Eng

  34. Ren J, Ma R (2009) Density-based data streams clustering over sliding windows. In: Proceedings of the FSKD, pp 248–252

  35. Shang S, Chen L, Jensen CS, Wen J-R, Kalnis P (2018) Searching trajectories by regions of interest. In: Proceedings of the ICDE, pp 1741–1742

  36. Shang S, Chen L, Wei Z, Jensen CS, Zheng K, Kalnis P (2018) Parallel trajectory similarity joins in spatial networks. VLDB J 27(3):395–420

    Article  Google Scholar 

  37. Shang S, Ding R, Zheng K, Jensen CS, Kalnis P, Zhou X (2014) Personalized trajectory matching in spatial networks. VLDB J 23 (3):449–468

    Article  Google Scholar 

  38. Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R (2006) Monic: modeling and monitoring cluster transitions. In: Proceedings of the SIGKDD, pp 706–711

  39. Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: Evolution-based technique for stream clustering. In: Proceedings of the ADMA, pp 605–615

  40. Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data 3(3):1–28

    Article  Google Scholar 

  41. Xu J, Chen J, Zhou R, Fang J, Liu C (2019) On workflow aware location-based service composition for personal trip planning. Futur Gener Comput Syst 98:274–285

    Article  Google Scholar 

  42. Xu J, Gao Y, Liu C, Zhao L, Ding Z (2015) Efficient route search on hierarchical dynamic road networks. Distrib Parallel Databases 33 (2):227–252

    Article  Google Scholar 

  43. Xu J, Zhao J, Zhou R, Liu Ch, Zhao P, Zhao L (2021) Predicting destinations by a deep learning based approach. IEEE Trans. Knowl Data Eng 33(2):651–666

    Article  Google Scholar 

  44. Yang K, Gao Y, Ma R, Chen L, Wu S, Chen G (2019) DBSCAN-MS: distributed density-based clustering in metric spaces. In: Proceedings of the ICDE, pp 1346–1357

  45. Yuan J, Zheng Y, Xie X, Sun G (2011) Driving with knowledge from the physical world. In: Proceedings of the SIGKDD, pp 316–324

  46. Yuan J, Zheng Y, Zhang C, Xie W, Xie X, Sun G, Huang Y (2010) T-drive: driving directions based on taxi trajectories. In: Proceedings of the SIGSPATIAL, pp 99–108

  47. Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15(2):181–214

    Article  Google Scholar 

  48. Zubaroglu A, Atalay V (2021) Data stream clustering: A review. Artif Intell Rev 54(2):1201–1236

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China under grant(No.61802273), Postdoctoral Science Foundation of China (No.2020M681529), Natural Science Foundation for Colleges and Universities in Jiangsu Province (No.18KJB520044).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junhua Fang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xia, Y., Fang, J., Chao, P. et al. Cost-effective and adaptive clustering algorithm for stream processing on cloud system. Geoinformatica 27, 1–21 (2023). https://doi.org/10.1007/s10707-021-00442-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10707-021-00442-1

Keywords

Navigation