Advertisement

Statistics and Computing

, Volume 26, Issue 5, pp 1101–1120 | Cite as

Divisive clustering of high dimensional data streams

  • David P. HofmeyrEmail author
  • Nicos G. Pavlidis
  • Idris A. Eckley
Article

Abstract

Clustering streaming data is gaining importance as automatic data acquisition technologies are deployed in diverse applications. We propose a fully incremental projected divisive clustering method for high-dimensional data streams that is motivated by high density clustering. The method is capable of identifying clusters in arbitrary subspaces, estimating the number of clusters, and detecting changes in the data distribution which necessitate a revision of the model. The empirical evaluation of the proposed method on numerous real and simulated datasets shows that it is scalable in dimension and number of clusters, is robust to noisy and irrelevant features, and is capable of handling a variety of types of non-stationarity.

Keywords

Clustering Data stream High dimensionality Population drift Modality testing 

Notes

Acknowledgments

David Hofmeyr gratefully acknowledges funding from both the Engineering and Physical Sciences Research Council (EPSRC) and the Oppenheimer Memorial Trust.

References

  1. Aggarwal, C.C.: A survey of stream clustering algorithms. In: Aggarwal, C.C., Reddy, C. (eds.) Data Clustering: Algorithms and Applications, pp. 457–482. CRC Press, Boca Raton (2013)Google Scholar
  2. Aggarwal, C.C., Han, J., Wang, J., Yu, P.: A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on Very large data bases, vol. 29, pp. 81–92 (2003)Google Scholar
  3. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clustering of high dimensional data streams. In: Proceedings of the Thirtieth international conference on Very large data bases, pp. 852–863 (2004)Google Scholar
  4. Amini, A., Saboohi, H., Wah, T.Y., Herawan, T.: Dmm-stream: A density mini-micro clustering algorithm for evolving datastreams. In: Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), 675-682 (2014)Google Scholar
  5. Amini, A., Wah, T.Y., Saboohi, H.: On density based data streams clustering algorithms: a survey. J. Comput. Sci. Technol. 29(1), 116–141 (2014)CrossRefGoogle Scholar
  6. Anagnostopoulos, C., Tasoulis, D.K., Adams, N.M., Pavlidis, N.G., Hand, D.J.: Online linear and quadratic discriminant analysis with adaptive forgetting for streaming classification. Stat. Anal. Data Min. 5(2), 139–166 (2012)MathSciNetCrossRefGoogle Scholar
  7. Ankerst, M., Breunig, M., Kriegel, H.P., Sander, J.: OPTICS: ordering points to identify the clustering structure. In: Proceedings of the ACM Sigmod Conference, pp. 49–60 (1999)Google Scholar
  8. Artac, M., Jogan, M., Leonardis, A.: Incremental PCA for on-line visual learning and recognition. In: Proceedings of the 16th International Conference on Pattern Recognition, vol. 3, pp. 781–784 (2002)Google Scholar
  9. Azzalini, A., Torelli, N.: Clustering via nonparametric density estimation. Stat. Comput. 17(1), 71–80 (2007). doi: 10.1007/s11222-006-9010-y MathSciNetCrossRefGoogle Scholar
  10. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 1–16 (2002)Google Scholar
  11. Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/m
  12. Boley, D.: Principal direction divisive partitioning. Data Min. Knowl. Discov. 2(4), 325–344 (1998)CrossRefGoogle Scholar
  13. Campello, R.J.G.B., Moulavi, D., Zimek, A., Sander, J.: A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Min. Knowl. Discov. 27(3), 344–371 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  14. Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining, pp. 328–339 (2006)Google Scholar
  15. Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142 (2007)Google Scholar
  16. Cuevas, A., Febrero, M., Fraiman, R.: Cluster analysis: a further approach based on density estimation. Comput. Stat. Data Anal. 36(4), 441–459 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  17. Cuevas, A., Fraiman, R.: A plug-in approach to support estimation. Ann. Stat. 25(6), 2300–2312 (1997)MathSciNetCrossRefzbMATHGoogle Scholar
  18. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)Google Scholar
  19. Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)CrossRefGoogle Scholar
  20. Hartigan, J.A.: Clustering Algorithms. Wiley Series in Probability and Mathematical Statistics. Wiley, New York (1975)Google Scholar
  21. Hartigan, P.M.: Algorithm as 217: computation of the dip statistic to test for unimodality. J. R. Stat. Soc. 34(3), 320–325 (1985)Google Scholar
  22. Hartigan, J.A., Hartigan, P.M.: The dip test of unimodality. Ann. Stat. 13(1), 70–84 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  23. Hassani, M., Kranen, P., Saini, R., Seidl, T.: Subspace anytime stream clustering. In: Proceedings of the 26th International Conference on Scientific and Statistical Database Management, p. 37 (2014)Google Scholar
  24. Hassani, M., Spaus, P., Gaber, M.M., Seidl, T.: Density-based projected clustering of data streams. In: Proceedings of the 6th International Conference on Scalable Uncertainty Management, pp. 311–324 (2012)Google Scholar
  25. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice-Hall International, Upper Saddle River (1999)zbMATHGoogle Scholar
  26. Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)CrossRefGoogle Scholar
  27. Jia, C., Tan, C., Yong, A.: A grid and density-based clustering algorithm for processing data stream. In: International Conference on Genetic and Evolutionary Computing (2008)Google Scholar
  28. Kranen, P., Assent, I., Baldauf, C., Seidl, T.: Self-adaptive anytime stream clustering. In: IEEE International Conference on Data Mining, pp. 249–258, doi: 10.1109/ICDM.2009.47 (2009)
  29. Kranen, P.: Anytime algorithms for stream data mining. Diese Dissertation. RWTH Aachen University (2011)Google Scholar
  30. Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data. 3(1), 1–58 (2009)CrossRefGoogle Scholar
  31. Li, Y., Xu, L.-Q., Morphett, J., Jacobs, R.: An integrated algorithm of incremental and robust pca. In: Proceedings of the International Conference on Image Processing, 1, pp. 245–248 (2009)Google Scholar
  32. Menardi, G., Azzalini, A.: An advancement in clustering via nonparametric density estimation. Stat. Comput. 24(5), 753–767 (2014). doi: 10.1007/s11222-013-9400-x MathSciNetCrossRefzbMATHGoogle Scholar
  33. Müller, D.W., Sawitzki, G.: Excess mass estimates and tests for multimodality. J. Am. Stat. Assoc. 86(415), 738–746 (1991)MathSciNetzbMATHGoogle Scholar
  34. Ntoutsi, I., Zimek, A., Palpanas, T., Kröger, P., Kriegel, H.P.: Density-based projected clustering over high dimensional data streams. In: Proceedings SiAM International Conference on Data Mining, pp. 987–998 (2012)Google Scholar
  35. Pavlidis, N.G., Tasoulis, D.K., Adams, N.M., Hand, D.J.: \(\lambda \)-perceptron: an adaptive classifier for data-streams. Pattern Recognit. 44(1), 78–96 (2011)CrossRefzbMATHGoogle Scholar
  36. Reynolds Jr, M.R., Stoumbos, Z.G.: A CUSUM chart for monitoring a proportion when inspecting continuously. J. Qual. Technol. 3(1), 87 (1999)Google Scholar
  37. Rigollet, P., Vert, R.: Optimal rates for plug-in estimators of density level sets. Bernoulli 15(4), 1154–1178 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  38. Rinaldo, A., Wasserman, L.: Generalized density clustering. Ann. Stat. 38(5), 2678–2722 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  39. Rosenberg, A., Hirschberg, J.: V-measure: A conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 410–420 (2007)Google Scholar
  40. Scott, D.W.: Multivariate Density Estimation: Theory, Practice, and Visualization, vol. 383. John Wiley & Sons, New York (2009)zbMATHGoogle Scholar
  41. Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., de Carvalho, A.C.P.L.F., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 13:1–13:31 (2013)CrossRefzbMATHGoogle Scholar
  42. Stuetzle, W.: Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 20(5), 25–47 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  43. Stuetzle, W., Nugent, R.: A generalized single linkage method for estimating the cluster tree of a density. J. Comput. Gr. Stat. 19(2), 397–418 (2010)MathSciNetCrossRefGoogle Scholar
  44. Tasoulis, S.K., Tasoulis, D.K., Plagianakos, V.: Enhancing principal direction divisive clustering. Pattern Recognit. 43(10), 3391–3411 (2010)CrossRefzbMATHGoogle Scholar
  45. Tasoulis, S.K., Tasoulis, D.K., Plagianakos, V.P.: Clustering of high dimensional data streams. In: Maglogiannis, L., Vlahavas, L. (eds.) Artificial Intelligence: Theories and Applications, pp. 223–230. Springer, Berlin (2012)CrossRefGoogle Scholar
  46. Vergara, A., Vembu, S., Ayhan, T., Ryan, M.A., Homer, M.L., Huerta, R.: Chemical gas sensor drift compensation using classifier ensembles. Sens. Actuators B 166, 320–329 (2012)CrossRefGoogle Scholar
  47. von Luxburg, U.: Clustering Stability. Now Publishers Inc, Hanover (2010)zbMATHGoogle Scholar
  48. Weng, J., Zhang, Y., Hwang, W.-S.: Candid covariance-free incremental principal component analysis. IEEE Trans. Pattern Anal. Mach. Intell. 25(8), 1034–1040 (2003)Google Scholar
  49. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. ACM SIGMOD Conf. 25, 103–114 (1996)CrossRefGoogle Scholar
  50. Zhao, Y., Karypis, G.: Criterion functions for document clustering: experiments and analysis. Mach. Learn. 42, 143–175 (2001)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • David P. Hofmeyr
    • 1
    Email author
  • Nicos G. Pavlidis
    • 2
  • Idris A. Eckley
    • 1
  1. 1.Department of Mathematics and StatisticsLancaster UniversityLancasterUK
  2. 2.Department of Management ScienceLancaster UniversityLancasterUK

Personalised recommendations