Skip to main content
Log in

Split incremental clustering algorithm of mixed data stream

  • Regular Paper
  • Published:
Progress in Artificial Intelligence Aims and scope Submit manuscript

Abstract

Clustering has been recognized as one of the most prominent functions in data mining. It aims to partition a given set of elements into homogeneous groups without any given knowledge about the distribution of data and according to some (dis)similarity criterion. In this paper, we propose a novel streaming algorithm, based on split technique that was introduced to avoid retaining from the scratch and to ensure the incremental clustering aspect. It intends to cluster continuously arriving chunks of data escorted with new mixed features within memory and time restrictions. Our proposed real-time clustering method clusters mixed data streams using split technique in order to tackle the incremental object, attribute, and class learning spaces at once. So, when necessary, the final distribution of the clusters has to be updated. By dint of split technique, changing the final clusters’ distribution has led to a promising clustering model. Experiments performed on real mixed data sets show that the proposal is efficient and outperforms the conventional k-prototypes method based on different evaluation measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Algorithm 3
Fig. 1
Algorithm 4
Fig. 2
Algorithm 5
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The data used in the experimentation section are open source and derived from U.C.I repository [26], openML [27] and Kaggle data sets https://www.kaggle.com/austinreese/craigslist-carstrucks-data.

Notes

  1. https://www.kaggle.com/austinreese/craigslist-carstrucks-data.

References

  1. Anderlucci, L., Fortunato, F., Montanari, A.: High-dimensional clustering via random projections. J. Classif. 1–26 (2021)

  2. Bhagat, A., Kshirsagar, N., Khodke, P., Dongre, K., Ali, S.: Penalty parameter selection for hierarchical data stream clustering. Proc. Comput. Sci. 79, 24–31 (2016)

    Article  Google Scholar 

  3. Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., Carvalho, A.C.D., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 1–31 (2013)

    Article  Google Scholar 

  4. Chefrour, A.: Incremental supervised learning: algorithms and applications in pattern recognition. Evol. Intel. 12(2), 97–112 (2019)

    Article  Google Scholar 

  5. Sowjanya, A.M., Shashi, M.: Cluster feature-based incremental clustering approach (CFICA) for numerical data. Int. J. Comput. Sci. Netw. Sec. 10(9), 73–79 (2010)

    Google Scholar 

  6. Lamirel, J. C., Mall, R., Ahmad, M.: Comportement comparatif des méthodes de clustering incrémentales et non incrémentales sur les données textuelles hétérogènes. In: 11th International Francophone Conference on Knowledge Extraction and Management (EGC 2011) (2011)

  7. Sowjanya, A.M., Shashi, M.: A cluster feature-based incremental clustering approach to mixed data. J. Comput. Sci. 7(12), 1875 (2011)

    Article  Google Scholar 

  8. Noorbehbahani, F., Mousavi, S.R., Mirzaei, A.: An incremental mixed data clustering method using a new distance measure. Soft. Comput. 19(3), 731–743 (2015)

    Article  Google Scholar 

  9. Shen, F., Hasegawa, O.: A fast nearest neighbor classifier based on self-organizing incremental neural network. Neural Netw. 21(10), 1537–1547 (2008)

    Article  Google Scholar 

  10. Aggarwal, C.C., Philip, S.Y., Han, J., Wang, J.: A framework for clustering evolving data streams. In: Proceedings 2003 VLDB Conference. Morgan Kaufmann, pp. 81–92 (2003)

  11. Ghesmoune, M., Lebbah, M., Azzag, H.: State-of-the-art on clustering data streams. Big Data Anal. 1(1), 1–27 (2016)

    Article  Google Scholar 

  12. Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++ a clustering algorithm for data streams. J. Exp. Algorithmics 17, 2–1 (2012)

    Article  MathSciNet  Google Scholar 

  13. Amini, A., Wah, T.Y., Saboohi, H.: On density-based data streams clustering algorithms: a survey. J. Comput. Sci. Technol. 29(1), 116–141 (2014)

    Article  Google Scholar 

  14. Ounali, C., Ben Rejab, F., & Nouira Ferchichi, K.: Incremental algorithm based on split technique. In: International Conference on Intelligent Systems Design and Applications. Springer, Cham, pp. 567–576 (2018)

  15. Bao, J., Wang, W., Yang, T., Wu, G.: An incremental clustering method based on the boundary profile. PLoS ONE 13(4), e0196108 (2018)

    Article  Google Scholar 

  16. Savaresi, S.M., Boley, D.L., Bittanti, S., Gazzaniga, G.: Cluster selection in divisive clustering algorithms. In: Proceedings of the 2002 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, pp. 299–314 (2002)

  17. Marszałek, Z.: Performance tests on merge sort and recursive merge sort for big data processing. Technical Sciences/University of Warmia and Mazury in Olsztyn (2018)

  18. Gorrab, S., Rejab, F.B.: IK-prototypes: incremental mixed attribute learning based on K-prototypes algorithm, a new method. In: International Conference on Intelligent Systems Design and Applications. Springer, Cham, pp. 880–890 (2020)

  19. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Disc. 2(3), 283–304 (1998)

    Article  Google Scholar 

  20. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, pp. 281–297 (1967)

  21. Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. Dmkd 3(8), 34–39 (1997)

    Google Scholar 

  22. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)

    Article  Google Scholar 

  23. Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. Theory Methods 3(1), 1–27 (1974)

    Article  MathSciNet  Google Scholar 

  24. Xu, R., Wunsch, D.C.: Clustering algorithms in biomedical research: a review. IEEE Rev. Biomed. Eng. 3, 120–154 (2010)

    Article  Google Scholar 

  25. Clarke, K.R., Chapman, M.G., Somerfield, P.J., Needham, H.R.: Dispersion-based weighting of species counts in assemblage analyses. Mar. Ecol. Prog. Ser. 320, 11–27 (2006)

    Article  Google Scholar 

  26. Asuncion, A., Newman, D.: UCI machine learning repository (2007)

  27. Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explor. Newsl. 15(2), 49–60 (2014)

    Article  Google Scholar 

  28. Guo, S., Dong, X.L., Srivastava, D., Zajac, R.: Record linkage with uniqueness constraints and erroneous values. Proc. VLDB Endow. 3(1–2), 417–428 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Siwar Gorrab.

Ethics declarations

Conflict of interest

Authors have no conflict of interest to declare.

Ethical approval:

This article does not contain any studies with human participants performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study. This manuscript is the authors’ original work and has not been submitted simultaneously elsewhere. All authors have checked the manuscript and agreed to the submission.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gorrab, S., Ben Rejab, F. & Nouira, K. Split incremental clustering algorithm of mixed data stream. Prog Artif Intell 13, 51–64 (2024). https://doi.org/10.1007/s13748-024-00316-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13748-024-00316-1

Keywords

Navigation