Abstract
Classification of imbalanced unlabelled data streams with concept drift in evolving streams has posed many challenges recently. Learner performance from the minority class is poor at high imbalance degrees. This causes drift detection to fail. Therefore, the existing model cannot be updated, resulting in poor classifier performance. Detecting drifts is typically done through supervised learning. They are impractical despite their effectiveness in detecting drifts. In real-world applications, only a portion of the data stream can be labelled as oracle assistance is pricey and laborious. To alleviate these problems, a novel technique which is a cluster based active learning for class imbalance and concept drift (CBAL) is presented in the paper. Adaptive sampling strategies are used for solving high imbalance degrees. A two-layer drift detection strategy is used for detecting drifts where the first layer is unsupervised and the second layer is supervised. To reduce the labelling cost this framework uses a clustering technique for querying the labels. Extensive experiments over synthetic and real-world data streams exhibit better classification performance. CBAL detects the drifts with fewer false alarms and with lesser oracle intervention. For high imbalanced case (i.e., 10%), the performance of CBAL is 53% and higher, whereas the performance of the other algorithms is zero or nil. The number of drifts detected by CBAL is much more accurate and it also reduces the labelling cost by 90%.
Similar content being viewed by others
References
Gamma J, Zliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46:1–37
Ditzler G, Roveri M, Alippi C, Polikar R (2015) Learning in nonstationary environments: a survey. IEEE Comput Intell Mag 10(4):12–25
He H, Edward AG (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Sun Y, Wong A, Kamel M (2009) Classification of imbalanced data. Int J Pattern Recognit Artif Intell 23(4):687–719
Haixiang G, Yijing L, Mingyun G, Yuanyue H, Bing G (2016) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Ali H, MohdSalleh MN, Saedudin R, Hussain K, Mushta MF (2019) Imbalance class problems in data mining: a review. Indones J Electric Eng Comput Sci 14(3):1552–1563
Wang S, Minku LL, Yao X (2018) A systematic study of online class imbalance learning with concept drift. IEEE Trans Neural Netw Learn Syst 29(10):4802–4821
Zhang W, Wang J (2017) A hybrid learning framework for imbalanced stream classification. In: 2017 IEEE International Congress on Big Data (Big Data Congress), pp 480–487
Sun Y (2017) A novel ensemble classification for data streams with class imbalance and concept drift. Int J Perform Eng 13(6):945–955
Krishnamurthy A, Agarwal A, Huang T, Daume H, Langford J (2019) Active learning for cost sensitive classification. J Mach Learn Res 20(65):1–50
Tran VC, Nguyen NT, Fujita H, Hoang DT, Hwang D (2017) A combination of active learning and self-learning for named entity recognition on twitter using conditional random fields. Knowl Based Syst 132:179–187
Song J, Wang H, Gao Y (2018) An active learning with confidence-based answers for crowdsourcing labelling tasks. Knowl Based Syst 159:244–258
Reyes O, Altalhi AH, Ventura S (2018) Statistical comparisons of active learning strategies over multiple datasets. Knowl Based Syst 145:274–288
Tegjyoth SS, Kantardzic M (2017) On the reliable detection of concept drift from streaming unlabelled data. Expert Syst Appl Int J 82:77–99
Zhu X, Zhang P, Lin X, Shi Y (2010) Active learning from stream data using optimal weight classifier ensemble. IEEE Trans Syst Man Cybern Part B Cybern 40(6):1607–1621
Zhang H, Liu W, Shan J, Liu Q (2018) Online active learning paired ensemble for concept drift and class imbalance. IEEE Access 6:73815–73828
Zliobaite A, Bifet B, Pfahringer HG (2014) Active learning with drifting streaming data. IEEE Trans Neural Netw Learn Syst 25(1):27–39
Wang M, Fu K, Min F, Jia X (2020) Active learning through label error statistical methods. Knowl Based Syst 189:105140
Krawczyk B (2017) Active and adaptive ensemble learning for online activity recognition from data streams. Knowl Based Syst 138:69–78
Korycki L, Cano A, krawczyk B (2019) Active learning with abstaining classifiers for imbalanced drifting data streams. In: IEEE international conference on big data (big data), pp 2334–2343
Wang S, Minku LL, Ghezzi D, Caltabiana D, Tino P, Yao X (2013) Concept drift detection for online class imbalance learning. In: The 2013 international joint conference on neural networks (IJCNN), pp. 1–10
Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Bazzan ALC, Labidi S (eds) Advances in artificial intelligence-SBIA 2004. Springer, Berlin Heidelberg, pp 286–295
Firas B, Bestoun SA, Andreas K (2022) From concept drift to model degradation: An overview on performance-aware drift detectors. Knowl Based Syst 245:108632
Loezer L, Enembreck F, Barddal JP, Britto A (2020) Cost-sensitive learning for imbalanced data streams. In SAC ‘20: Proceedings of the 35th Annual ACM Symposium on Applied Computing, pp 498–504
Wang S, Minku LL, Yao X (2015) Resampling based ensemble methods for online class imbalance learning. IEEE Trans Knowl Data Eng 27(5):1356–1368
Barua S, Islam MM, Murase K (2015) Gosil: A generalized over-sampling based online imbalanced learning framework. In: Arik S, Huang T, Lai W, Liu Q (eds) Neural Information Processing, ICONIP Lecture Notes in Computer Science. Springer, Cham
Zhang H, Liu W, Wang S, Shan J, Liu Q (2019) Resample-based ensemble framework for drifting imbalanced data streams. IEEE Access 7:65103–65115
Radhika VK, Revathy S, Suhas P (2022) Smart pools of data with ensembles for adaptive learning in dynamic data streams with class imbalance. IAES Int J Artif Intell IJAI 11(1):310–318
Sun Y, Li M, Li L, Shao H, Sun Y (2021) Cost-sensitive classification for evolving data streams with concept drift and class imbalance. Comput Intell Neurosci 2021:9
Sun Y, Sun Y, Dai H (2020) Two-stage cost-sensitive learning for data streams with concept drift and class imbalance. IEEE Access 8:191942–191955
Cano A, Krawczyk B (2020) Kappa updated ensemble for drifting data stream mining. Mach Learn 109(1):175–218
Zhao P, Zhang Y, Wu M, Hoi SCH, Tan M, Huang J (2019) Adaptive cost-sensitive online classification. IEEE Trans Knowl Data Eng 31(2):214–228
Jyoti M, Angshul M, Emilie C (2021) Transformed subspace clustering. IEEE Trans Knowl Data Eng 33(4):1796–1801. https://doi.org/10.1109/TKDE.2020.2969354
Jyoti M, Angshul M, Emilie C, Giovanni C (2020) Deeply transformed subspace clustering. Signal Process 174:107628
Jyoti M, Angshul M, Emilie C (2018). Transformed Locally Linear Manifold Clustering. In: 26th European Signal Processing Conference (EUSIPCO), Rome, Italy 1057–1061. https://doi.org/10.23919/EUSIPCO.2018.8553061.
Wang H, Zubin A (2015) Concept drift detection for streaming data. In: international joint conference on neural networks (IJCNN), pp 1–9
Brzezinski D, Brzezinski D (2017) Stefanowski J (2017) Properties of the area under the roc curve for data streams with concept drift. Knowl Inf Syst 52:51–562
Shujian Yu, Abraham Z, Wang H, Mohak S, Prinicipe J (2019) Concept drift detection and adaptation with hierarchical hypothesis testing. J Franklin Inst 356(5):3187–3215
Wang S, Minku LL (2019) AUC estimation and concept drift detection for imbalanced data streams with multiple classes. In: 2020 international joint conference on neural networks (IJCNN), pp 1–8
Micevska S, Awad A, Sakr S (2021) SDDM: An interpretable statistical concept drift detection method for data streams. J Intell Inf Syst 56:459–484
Li P, Wu M, He J, Hu X (2021) Recurring drift detection and model selection-based ensemble classification for data streams with unlabelled data. N Gener Comput 39:341–376
Yang LU, Cheung Y, Tang YY (2017) Dynamic Weighted Majority for Incremental Learning of Imbalanced Data Streams with Concept Drift. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp 393–399
Yang LU, Cheung Y, Tang YY (2020) Adaptive chunk-based dynamic weighted majority for imbalanced data streams with concept drift. IEEE Trans Neural Netw Learn Syst 31(8):2764–2778
Jiao B, Guo Y, Gong D, Chen, Q (2022) Dynamic Ensemble Selection for Imbalanced Data Streams with Concept Drift. In: proceedings of IEEE Transactions on Neural Networks and Learning Systems.
Angluin D (1988) Queries and concept learning. Mach Learn 2:319–342
Cohn D, Atlas L, Ladner R (1994) Improving generalization with active learning. Mach Learn 15(2):201–221
Liu W, Zhang H, Ding Z, Liu Q, Zhu C (2021) A comprehensive active learning method for multiclass imbalanced data streams with concept drift. Knowl Based Syst 215:106778
Korycki L, Krawczyk B (2020) Online oversampling for sparsely labelled imbalanced and nonstationary data streams. In: 2020 international joint conference on neural networks (IJCNN), pp. 1–8
Krawczyk B, Pfahringer B, Wozniak M (2018) Combining active learning with concept drift detection for data stream mining. In: IEEE International Conference on big data (big data), pp 2239–2244
Shan J, Zhang H, Liu W, Liu Q (2019) Online active learning ensemble framework for drifted data streams. IEEE Trans Neural Netw Learn Syst 30(2):486–498
Zhang X, Yang T, Srinivasan P (2016) Online asymmetric active learning with imbalanced data. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD 16, pp 2055–2064
Lewis D, Gale W (1994) A sequential algorithm for training text classifiers. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, ACM/Springer, pp 3–12.
Tharwat A, Schenck W (2020) Balancing exploration and exploitation: a novel active learner for imbalanced data. Knowl Based Syst 210:106500
Zheng X, Li P, Hu X, Yu K (2021) Semi supervised classification on data streams with recurring concept drift and concept evolution. Knowl Based Syst 215:106749
Arabmakki E (2016) A reduced labelled samples (RLS) framework for classification of imbalanced concept-drifting data
Ksieniewicz P, Wozniak M, Cyganek B, Kasprzak A, Walkowiak K (2019) Data stream classification using active learned neural networks. Neurocomputing 353:74–82
Hualong Y, Yang X, Zheng S, Sun C (2019) Active learning from imbalanced data a solution of online weighted extreme learning machine. IEEE Trans Neural Netw Learn Syst 30(4):1088–1103
Krawczyk B, Cano A (2019) Adaptive ensemble active learning for drifting data stream mining. In: Kraus S (ed.), Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCA, Macao, China, pp 2763–2771
Seung HS, Opper M, Sompolinsky H (1992) Query by committee. In: Proceedings of the fifth annual workshop on Computational Learning Theory, pp. 287–29.
Roy N, McCallum A (2001) Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the International Conference on Machine Learning (ICML), pp 441–448, Morgan Kaufmann.
Cohn D, Ghahramani Z, Jordan ML (1996) Active learning with statistical models. J Artif Intell Res 4:129–145
Ienco D, Bifet A, Zliobaite I, Pfahringer B (2013) Clustering Based Active Learning for Evolving Data Streams. In: Furnkranz J, Hullermeier E, Higuchi T (eds) Discovery Science. Lecture Notes in Computer Science. Springer, Berlin
Bodo Z, Minier Z, Lehel C (2011) Active learning with clustering. JMLR Workshop Active Learn Exp Des 16:127–139
Patra S, Bruzzone L (2012) A fast cluster-assumption based batch mode active learning technique. Pattern Recogn Lett 33(9):1042–1048
Patist JP (2007) Optimal window change detection. In: Proceedings of 7th IEEE International Conference of Data Mining Workshops, pp 557–562
Nishida K, Yamauchi K (2007) Detecting concept drift using statistical testing. In: International Conference on Discovery Science, Berlin, Germany, pp 264– 269
Peacock JA (1983) Two-dimensional goodness of-fit testing in astronomy. Mon Not R Astron Soc 202(3):615–627
Bifet A, Holmes G, Kirkby R (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604
Minku LL, White AP, Yao X (2010) The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Trans Knowl Data Eng 22(5):730–742
Gama J, Sebastiao R, Rodrigues PP (2012) On evaluating stream learning algorithms. Mach Learn 90:317–346
Acknowledgements
This study was funded by India’s defence research and development organisation (DRDO) under the sanction code ERIPR/GIA/17-18/038. The work was reviewed by the centre for artificial intelligence and robotics (CAIR). We would like to thank the late Dr. T. Maruthi Padmaja for her assistance and support in this work, and she is the grant recipient.
Author information
Authors and Affiliations
Contributions
Dirsumilli Himaja wrote the main manuscript text, prepared figures and tables and all authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Himaja, D., Dondeti, V., Uppalapati, S. et al. Cluster based active learning for classification of evolving streams. Evol. Intel. (2023). https://doi.org/10.1007/s12065-023-00879-3
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12065-023-00879-3