Split incremental clustering algorithm of mixed data stream

Gorrab, Siwar; Ben Rejab, Fahmi; Nouira, Kaouther

doi:10.1007/s13748-024-00316-1

Split incremental clustering algorithm of mixed data stream

Regular Paper
Published: 07 March 2024

Volume 13, pages 51–64, (2024)
Cite this article

Progress in Artificial Intelligence Aims and scope Submit manuscript

66 Accesses
Explore all metrics

Abstract

Clustering has been recognized as one of the most prominent functions in data mining. It aims to partition a given set of elements into homogeneous groups without any given knowledge about the distribution of data and according to some (dis)similarity criterion. In this paper, we propose a novel streaming algorithm, based on split technique that was introduced to avoid retaining from the scratch and to ensure the incremental clustering aspect. It intends to cluster continuously arriving chunks of data escorted with new mixed features within memory and time restrictions. Our proposed real-time clustering method clusters mixed data streams using split technique in order to tackle the incremental object, attribute, and class learning spaces at once. So, when necessary, the final distribution of the clusters has to be updated. By dint of split technique, changing the final clusters’ distribution has led to a promising clustering model. Experiments performed on real mixed data sets show that the proposal is efficient and outperforms the conventional k-prototypes method based on different evaluation measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Algorithm 2

Enhancing the DISSFCM Algorithm for Data Stream Classification

StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality

State-of-the-art on clustering data streams

Article Open access 01 December 2016

Data availability

The data used in the experimentation section are open source and derived from U.C.I repository [26], openML [27] and Kaggle data sets https://www.kaggle.com/austinreese/craigslist-carstrucks-data.

Notes

https://www.kaggle.com/austinreese/craigslist-carstrucks-data.

References

Anderlucci, L., Fortunato, F., Montanari, A.: High-dimensional clustering via random projections. J. Classif. 1–26 (2021)
Bhagat, A., Kshirsagar, N., Khodke, P., Dongre, K., Ali, S.: Penalty parameter selection for hierarchical data stream clustering. Proc. Comput. Sci. 79, 24–31 (2016)
Article Google Scholar
Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., Carvalho, A.C.D., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 1–31 (2013)
Article Google Scholar
Chefrour, A.: Incremental supervised learning: algorithms and applications in pattern recognition. Evol. Intel. 12(2), 97–112 (2019)
Article Google Scholar
Sowjanya, A.M., Shashi, M.: Cluster feature-based incremental clustering approach (CFICA) for numerical data. Int. J. Comput. Sci. Netw. Sec. 10(9), 73–79 (2010)
Google Scholar
Lamirel, J. C., Mall, R., Ahmad, M.: Comportement comparatif des méthodes de clustering incrémentales et non incrémentales sur les données textuelles hétérogènes. In: 11th International Francophone Conference on Knowledge Extraction and Management (EGC 2011) (2011)
Sowjanya, A.M., Shashi, M.: A cluster feature-based incremental clustering approach to mixed data. J. Comput. Sci. 7(12), 1875 (2011)
Article Google Scholar
Noorbehbahani, F., Mousavi, S.R., Mirzaei, A.: An incremental mixed data clustering method using a new distance measure. Soft. Comput. 19(3), 731–743 (2015)
Article Google Scholar
Shen, F., Hasegawa, O.: A fast nearest neighbor classifier based on self-organizing incremental neural network. Neural Netw. 21(10), 1537–1547 (2008)
Article Google Scholar
Aggarwal, C.C., Philip, S.Y., Han, J., Wang, J.: A framework for clustering evolving data streams. In: Proceedings 2003 VLDB Conference. Morgan Kaufmann, pp. 81–92 (2003)
Ghesmoune, M., Lebbah, M., Azzag, H.: State-of-the-art on clustering data streams. Big Data Anal. 1(1), 1–27 (2016)
Article Google Scholar
Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++ a clustering algorithm for data streams. J. Exp. Algorithmics 17, 2–1 (2012)
Article MathSciNet Google Scholar
Amini, A., Wah, T.Y., Saboohi, H.: On density-based data streams clustering algorithms: a survey. J. Comput. Sci. Technol. 29(1), 116–141 (2014)
Article Google Scholar
Ounali, C., Ben Rejab, F., & Nouira Ferchichi, K.: Incremental algorithm based on split technique. In: International Conference on Intelligent Systems Design and Applications. Springer, Cham, pp. 567–576 (2018)
Bao, J., Wang, W., Yang, T., Wu, G.: An incremental clustering method based on the boundary profile. PLoS ONE 13(4), e0196108 (2018)
Article Google Scholar
Savaresi, S.M., Boley, D.L., Bittanti, S., Gazzaniga, G.: Cluster selection in divisive clustering algorithms. In: Proceedings of the 2002 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, pp. 299–314 (2002)
Marszałek, Z.: Performance tests on merge sort and recursive merge sort for big data processing. Technical Sciences/University of Warmia and Mazury in Olsztyn (2018)
Gorrab, S., Rejab, F.B.: IK-prototypes: incremental mixed attribute learning based on K-prototypes algorithm, a new method. In: International Conference on Intelligent Systems Design and Applications. Springer, Cham, pp. 880–890 (2020)
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Disc. 2(3), 283–304 (1998)
Article Google Scholar
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, pp. 281–297 (1967)
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. Dmkd 3(8), 34–39 (1997)
Google Scholar
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)
Article Google Scholar
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. Theory Methods 3(1), 1–27 (1974)
Article MathSciNet Google Scholar
Xu, R., Wunsch, D.C.: Clustering algorithms in biomedical research: a review. IEEE Rev. Biomed. Eng. 3, 120–154 (2010)
Article Google Scholar
Clarke, K.R., Chapman, M.G., Somerfield, P.J., Needham, H.R.: Dispersion-based weighting of species counts in assemblage analyses. Mar. Ecol. Prog. Ser. 320, 11–27 (2006)
Article Google Scholar
Asuncion, A., Newman, D.: UCI machine learning repository (2007)
Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explor. Newsl. 15(2), 49–60 (2014)
Article Google Scholar
Guo, S., Dong, X.L., Srivastava, D., Zajac, R.: Record linkage with uniqueness constraints and erroneous values. Proc. VLDB Endow. 3(1–2), 417–428 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

ISGT, LR99ES04 BESTMOD, Université de Tunis, Tunis, Tunisia
Siwar Gorrab, Fahmi Ben Rejab & Kaouther Nouira

Authors

Siwar Gorrab
View author publications
You can also search for this author in PubMed Google Scholar
Fahmi Ben Rejab
View author publications
You can also search for this author in PubMed Google Scholar
Kaouther Nouira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Siwar Gorrab.

Ethics declarations

Conflict of interest

Authors have no conflict of interest to declare.

Ethical approval:

This article does not contain any studies with human participants performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study. This manuscript is the authors’ original work and has not been submitted simultaneously elsewhere. All authors have checked the manuscript and agreed to the submission.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gorrab, S., Ben Rejab, F. & Nouira, K. Split incremental clustering algorithm of mixed data stream. Prog Artif Intell 13, 51–64 (2024). https://doi.org/10.1007/s13748-024-00316-1

Download citation

Received: 05 April 2022
Accepted: 21 February 2024
Published: 07 March 2024
Issue Date: March 2024
DOI: https://doi.org/10.1007/s13748-024-00316-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Split incremental clustering algorithm of mixed data stream

Abstract

Access this article

Similar content being viewed by others

Enhancing the DISSFCM Algorithm for Data Stream Classification

StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality

State-of-the-art on clustering data streams

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval:

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Split incremental clustering algorithm of mixed data stream

Abstract

Access this article

Similar content being viewed by others

Enhancing the DISSFCM Algorithm for Data Stream Classification

StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality

State-of-the-art on clustering data streams

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval:

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation