Advertisement

Clustering in the Presence of Concept Drift

  • Richard Hugh MoultonEmail author
  • Herna L. Viktor
  • Nathalie Japkowicz
  • João Gama
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11051)

Abstract

Clustering naturally addresses many of the challenges of data streams and many data stream clustering algorithms (DSCAs) have been proposed. The literature does not, however, provide quantitative descriptions of how these algorithms behave in different circumstances. In this paper we study how the clusterings produced by different DSCAs change, relative to the ground truth, as quantitatively different types of concept drift are encountered. This paper makes two contributions to the literature. First, we propose a method for generating real-valued data streams with precise quantitative concept drift. Second, we conduct an experimental study to provide quantitative analyses of DSCA performance with synthetic real-valued data streams and show how to apply this knowledge to real world data streams. We find that large magnitude and short duration concept drifts are most challenging and that DSCAs with partitioning-based offline clustering methods are generally more robust than those with density-based offline clustering methods. Our results further indicate that increasing the number of classes present in a stream is a more challenging environment than decreasing the number of classes. Code related to this paper is available at: https://doi.org/10.5281/zenodo.1168699, https://doi.org/10.5281/zenodo.1216189, https://doi.org/10.5281/zenodo.1213802, https://doi.org/10.5281/zenodo.1304380.

Keywords

Data streams Clustering Concept drift 

Supplementary material

478880_1_En_21_MOESM1_ESM.pdf (402 kb)
Supplementary material 1 (pdf 402 KB)

References

  1. 1.
    Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: StreamKM++: a clustering algorithm for data streams. ACM J. Exp. Algorithmics 17(2), 2–4 (2012).  https://doi.org/10.1145/2133803.2184450MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: 29th Very Large Database Conference, p. 12, Berlin (2003)Google Scholar
  3. 3.
    Ahmed, M., Naser Mahmood, A., Hu, J., Mahmood, A.N., Hu, J.: A survey of network anomaly detection techniques. J. Netw. Comput. Appl. 60, 19–31 (2016).  https://doi.org/10.1016/j.jnca.2015.11.016CrossRefGoogle Scholar
  4. 4.
    Barbará, D.: Requirements for clustering data streams. ACM SIGKDD Explor. Newsl. 3(2), 23–27 (2002).  https://doi.org/10.1145/507515.507519CrossRefGoogle Scholar
  5. 5.
    Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Moa: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)Google Scholar
  6. 6.
    Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Ghosh, J., Lambert, D., Skillicorn, D., Srivastava, J. (eds.) Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339, Bethesda (2006).  https://doi.org/10.1137/1.9781611972764.29
  7. 7.
    Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142, San Jose, USA (2007)Google Scholar
  8. 8.
    Creech, G., Hu, J.: Generation of a new IDS test dataset: time to retire the KDD collection. In: 2013 IEEE Wireless Communications and Networking Conference (WCNC), pp. 4487–4492. IEEE (2013)Google Scholar
  9. 9.
    Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006).  https://doi.org/10.1016/j.jecp.2010.03.005MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Faria, E.R., Gonçalves, I.J., de Carvalho, A.C., Gama, J.: Novelty detection in data streams. Artif. Intell. Rev. 45(2), 235–269 (2016).  https://doi.org/10.1007/s10462-015-9444-8CrossRefGoogle Scholar
  11. 11.
    de Faria, E.R., de Leon, P., Ferreira Carvalho, A.C., Gama, J.: MINAS: multiclass learning algorithm for novelty detection in data streams. Data Min. Knowl. Disc. 30(3), 640–680 (2015).  https://doi.org/10.1007/s10618-015-0433-yMathSciNetCrossRefGoogle Scholar
  12. 12.
    Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4), 1–37 (2014).  https://doi.org/10.1145/2523813CrossRefzbMATHGoogle Scholar
  13. 13.
    Ghesmoune, M., Azzag, H., Lebbah, M.: G-Stream: growing neural gas over data stream. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds.) ICONIP 2014. LNCS, vol. 8834, pp. 207–214. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-12637-1_26CrossRefGoogle Scholar
  14. 14.
    Ghesmoune, M., Lebbah, M., Azzag, H.: State-of-the-art on clustering data streams. Big Data Anal. 1(1), 13 (2016).  https://doi.org/10.1186/s41044-016-0011-3CrossRefGoogle Scholar
  15. 15.
    Haider, W., Hu, J., Xie, M.: Towards reliable data feature retrieval and decision engine in host-based anomaly detection systems. Proceedings of the 2015 10th IEEE Conference on Industrial Electronics and Applications, ICIEA 2015, pp. 513–517 (2015).  https://doi.org/10.1109/ICIEA.2015.7334166
  16. 16.
    Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers, San Francisco (2011)zbMATHGoogle Scholar
  17. 17.
    Hoplaros, D., Tari, Z., Khalil, I.: Data summarization for network traffic monitoring. J. Netw. Comput. Appl. 37(1), 194–205 (2014).  https://doi.org/10.1016/j.jnca.2013.02.021CrossRefGoogle Scholar
  18. 18.
    Japkowicz, N., Shah, M.: Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, New York (2011).  https://doi.org/10.1017/CBO9780511921803CrossRefzbMATHGoogle Scholar
  19. 19.
    Kremer, H., Kranen, P., Jansen, T., Seidl, T., Bifet, A., Holmes, G.: An effective evaluation measure for clustering on evolving data streams. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2011), pp. 868–876. ACM Press, New York (2011)Google Scholar
  20. 20.
    Masud, M., Gao, J., Khan, L., Han, J., Thuraisingham, B.M.: Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Trans. Knowl. Data Eng. 23(6), 859–874 (2011).  https://doi.org/10.1109/TKDE.2010.61CrossRefGoogle Scholar
  21. 21.
    Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., De Carvalho, A.C., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 1–31 (2013).  https://doi.org/10.1145/2522968.2522981CrossRefzbMATHGoogle Scholar
  22. 22.
    Souza, V.M.A., Silva, D.F., Gama, J., Batista, G.E.A.P.A.: Data stream classification guided by clustering on nonstationary environments and extreme verification latency. In: Proceedings of the 2015 SIAM International Conference on Data Mining, pp. 873–881. Society for Industrial and Applied Mathematics, Philadelphia, PA, June 2015.  https://doi.org/10.1137/1.9781611974010.98
  23. 23.
    Webb, G.I., Hyde, R., Cao, H., Nguyen, H.L., Petitjean, F.: Characterizing concept drift. Data Min. Knowl. Disc. 30(4), 964–994 (2016).  https://doi.org/10.1007/s10618-015-0448-4MathSciNetCrossRefGoogle Scholar
  24. 24.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering databases method for very large. ACM SIGMOD Rec. 25(2), 103–114 (1996).  https://doi.org/10.1145/233269.233324CrossRefGoogle Scholar
  25. 25.
    Žliobaitė, I.: Learning under concept drift: an overview. Vilnius University, Technical report (2010)Google Scholar
  26. 26.
    Žliobaitė, I., Bifet, A., Pfahringer, B., Holmes, G.: Active learning with drifting streaming data. IEEE Trans. Neural Netw. Learn. Syst. 25(1), 27–39 (2014).  https://doi.org/10.1109/TNNLS.2012.2236570CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Richard Hugh Moulton
    • 1
    Email author
  • Herna L. Viktor
    • 1
  • Nathalie Japkowicz
    • 2
  • João Gama
    • 3
  1. 1.School of Electrical Engineering and Computer ScienceUniversity of OttawaOttawaCanada
  2. 2.Department of Computer ScienceAmerican UniversityWashington DCUSA
  3. 3.Faculty of EconomicsUniversity of PortoPortoPortugal

Personalised recommendations