Skip to main content

Multi-window based ensemble learning for classification of imbalanced streaming data

Abstract

Imbalanced streaming data is commonly encountered in real-world data mining and machine learning applications, and has attracted much attention in recent years. Both imbalanced data and streaming data in practice are normally encountered together; however, little research work has been studied on the two types of data together. In this paper, we propose a multi-window based ensemble learning method for the classification of imbalanced streaming data. Three types of windows are defined to store the current batch of instances, the latest minority instances, and the ensemble classifier. The ensemble classifier consists of a set of latest sub-classifiers, and the instances employed to train each sub-classifier. All sub-classifiers are weighted prior to predicting the class labels of newly arriving instances, and new sub-classifiers are trained only when the precision is below a predefined threshold. Extensive experiments on synthetic datasets and real-world datasets demonstrate that the new approach can efficiently and effectively classify imbalanced streaming data, and generally outperforms existing approaches.

This is a preview of subscription content, access via your institution.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5

Notes

  1. http://moa.cs.waikato.ac.nz/datasets/

  2. http://archive.ics.uci.edu/ml/datasets.html

  3. http://moa.cms.waikato.ac.nz/

References

  1. Alippi, C., Boracchi, G., Roveri, M.: Just in time classifiers: Managing the slow drift case. In: International Joint Conference on Neural Networks, 2009. IJCNN 2009. pp. 114–120 (2009).

  2. Bifet, A., Gavaldà, R.: Learning from time-changing data with adaptive windowing. In, In SIAM International Conference on Data Mining (2007)

    Book  Google Scholar 

  3. Bifet, A., Gavaldà, R.: Adaptive learning from evolving data streams. In: Adams, N.M., Robardet, C., Siebes, A., Boulicaut, J.-F. (eds.) Advances in Intelligent Data Analysis VIII, pp. 249–260. Springer, Berlin Heidelberg (2009)

    Chapter  Google Scholar 

  4. Chawla, N.V.: Data mining for imbalanced datasets: an overview. In: Maimon, O. and Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook. pp. 853–867. Springer US (2005).

  5. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    MATH  Google Scholar 

  6. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) Knowledge Discovery in Databases: PKDD 2003, pp. 107–119. Springer, Berlin Heidelberg (2003)

    Chapter  Google Scholar 

  7. Chen, S., He, H.: Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach. Evol. Syst. 2, 35–50 (2010)

    Article  Google Scholar 

  8. Delany, S.J., Cunningham, P., Tsymbal, A., Coyle, L.: A case-based technique for tracking concept drift in spam filtering. In: CEng, P.A.M.Bs., MSc, R.E.Bs., and Allen, D.T. (eds.) Applications and Innovations in Intelligent Systems XII. pp. 3–16. Springer London (2005).

  9. Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 71–80. ACM, New York (2000).

  10. Elwell, R., Polikar, R.: Incremental learning of concept drift in nonstationary environments. IEEE Trans. Neural Netw. 22, 1517–1531 (2011)

    Article  Google Scholar 

  11. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009)

    Article  Google Scholar 

  12. Hoens, T.R., Polikar, R., Chawla, N.V.: Learning from streaming data with concept drift and imbalance: an overview. Prog. Artif. Intell. 1, 89–101 (2012)

    Article  Google Scholar 

  13. Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 97–106. ACM, New York, (2001).

  14. Kolter, J.Z., Maloof, M.: Dynamic weighted majority: a new ensemble method for tracking concept drift. In: Third IEEE International Conference on Data Mining, 2003. ICDM 2003. pp. 123–130 (2003).

  15. Lichtenwalter, R.N., Chawla, N.V.: Adaptive methods for classification in arbitrarily imbalanced and drifting data Streams. In: Theeramunkong, T., Nattee, C., Adeodato, P.J.L., Chawla, N., Christen, P., Lenca, P., Poon, J., Williams, G. (eds.) New Frontiers in Applied Data Mining, pp. 53–75. Springer, Berlin Heidelberg (2010)

    Chapter  Google Scholar 

  16. Liu, W., Chawla, S., Cieslak, D.A., Chawla, N.V.: A robust decision tree algorithm for imbalanced data sets. In: in SIAM International Conference on Data Mining, 2010. pp. 766–777.

  17. Liu, W., Wang, L., Yi, M.: Simple-random-sampling-based multiclass text classification algorithm. Sci. World J. 2014, 1–7 (2014)

    Google Scholar 

  18. Parveen, P., Weger, Z.R., Thuraisingham, B., Hamlen, K., Khan, L.: Supervised learning for insider threat detection using stream mining. In: Proceedings of the 2011 I.E. 23rd International Conference on Tools with Artificial Intelligence. pp. 1032–1039. IEEE Computer Society, Washington, DC, (2011).

  19. Rifkin, R., Klautau, A.: In defense of one-vs-all classification. J. Mach. Learn. Res. 5, 101–141 (2004)

    MathSciNet  MATH  Google Scholar 

  20. Shen, X., Boutell, M., Luo, J., Brown, C.: Multilabel machine learning and its application to semantic scene classification. Presented at the Storage and Retrieval Methods and Applications for Multimedia 2004 December 1 (2003).

  21. Shi, J., Luo, Z.: Nonlinear dimensionality reduction of gene expression data for visualization and clustering analysis of cancer tissue samples. Comput. Biol. Med. 40, 723–732 (2010)

    Article  Google Scholar 

  22. Street, W.N., Kim, Y.: A Streaming ensemble algorithm (SEA) for large-scale classification. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 377–382. ACM, New York, (2001).

  23. Sun, Y., Wong, A.K.C., Wang, Y.: Parameter inference of cost-sensitive boosting algorithms. In: Perner, P. and Imiya, A. (eds.) Machine Learning and Data Mining in Pattern Recognition. pp. 21–30. Springer Berlin Heidelberg (2005).

  24. Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)

    MathSciNet  MATH  Google Scholar 

  25. Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 226–235. ACM, New York (2003).

  26. Wang, X., Jia, Y., Chen, R., Fan, H., Zhou, B.: Improving text categorization with semantic knowledge in wikipedia. IEICE Trans. Inf. Syst. E96-D, 2786–2794 (2013a)

    Article  Google Scholar 

  27. Wang, S., Minku, L.L., Yao, X.: A learning framework for online class imbalance learning. In: 2013 I.E. Symposium on Computational Intelligence and Ensemble Learning (CIEL). pp. 36–45 (2013b).

  28. Wang, S., Minku, L.L., Yao, X.: Online class imbalance learning and its applications in fault detection. Int. J. Comput. Intell. Appl. 12, 1340001 (2013c)

    Article  Google Scholar 

  29. Wang, Y., Li, H., Wang, H., Zhou, B., Zhang, Y.: Multi-window based ensemble learning for classification of imbalanced streaming data. In: 16th International Conference on Web Information Systems Engineering. pp. 78–92. Springer International Publishing, Miami, (2015).

  30. Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Mach. Learn. 23, 69–101 (1996)

    Google Scholar 

  31. Xioufis, E.S., Spiliopoulou, M., Tsoumakas, G., Vlahavas, I.: Dealing with concept drift and class imbalance in multi-label stream classification. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Two. pp. 1583–1588. AAAI Press, Barcelona, Catalonia, Spain (2011).

  32. Zhang, D., Shen, H., Hui, T., Li, Y., Wu, J., Sang, Y.: A selectively re-train approach based on clustering to classify concept-drifting data streams with skewed distribution. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., and Kao, H.-Y. (eds.) Advances in Knowledge Discovery and Data Mining. pp. 413–424. Springer International Publishing (2014).

Download references

Acknowledgements

This work was supported by ARC DP project (DP 130101327), 973 Program (Grant No. 2013CB329601, 2013CB329602, 2013CB329604), 863 Program (Grant No. 2012AA01A401, 2012AA01A402).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hu Li.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, H., Wang, Y., Wang, H. et al. Multi-window based ensemble learning for classification of imbalanced streaming data. World Wide Web 20, 1507–1525 (2017). https://doi.org/10.1007/s11280-017-0449-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-017-0449-x

Keywords

  • Streaming data
  • Class imbalance
  • Multi-window
  • Ensemble learning