A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Aguiar, Gabriel; Krawczyk, Bartosz; Cano, Alberto

doi:10.1007/s10994-023-06353-6

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Published: 29 June 2023

Volume 113, pages 4165–4243, (2024)
Cite this article

Machine Learning Aims and scope Submit manuscript

1947 Accesses
22 Citations
3 Altmetric
Explore all metrics

Abstract

Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures and benchmarks on how to evaluate these algorithms. This work proposes a standardized, exhaustive, and comprehensive experimental framework to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to a large-scale experimental study comparing state-of-the-art classifiers in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental framework is fully reproducible and easy to extend with new methods. This way, we propose a standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create complete, trustworthy, and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 29

Learning from Imbalanced Data Streams Based on Over-Sampling and Instance Selection

Learning from Imbalanced Data Streams Using Rotation-Based Ensemble Classifiers

Chebyshev approaches for imbalanced data streams regression models

Article 20 September 2021

Data availability

Data & materials available at https://github.com/canoalberto/imbalanced-streams.

Code availability

Source code is available at https://github.com/canoalberto/imbalanced-streams.

Notes

Source code, experiments, and results are available at https://github.com/canoalberto/imbalanced-streams.
Interactive plots and tables for all experiments are available at https://people.vcu.edu/~acano/imbalanced-streams.
Complete results for all experiments are available at https://people.vcu.edu/~acano/imbalanced-streams.
https://github.com/mlrep/imb-drift-20.

References

Abolfazli, A., & Ntoutsi, E. (2020). Drift-aware multi-memory model for imbalanced data streams. In IEEE international conference on big data (pp. 878–885).
Aguiar, G., & Cano, A. (2023). An active learning budget-based oversampling approach for partially labeled multi-class imbalanced data streams. In 38th ACM/SIGAPP symposium on applied computing (pp. 1–8).
Al-Khateeb, T., Masud, M. M., Khan, L., Aggarwal, C., Han, J., & Thuraisingham, B. (2012). Stream classification with recurring and novel class detection using class-based ensemble. In IEEE international conference on data mining (pp. 31–40).
Al-Shammari, A., Zhou, R., Naseriparsaa, M., & Liu, C. (2019). An effective density-based clustering and dynamic maintenance framework for evolving medical data streams. International Journal of Medical Informatics, 126, 176–186.
Google Scholar
Alberghini, G., Barbon, S., & Cano, A. (2022). Adaptive ensemble of self-adjusting nearest neighbor subspaces for multi-label drifting data streams. Neurocomputing, 481, 228–248.
Google Scholar
Aminian, E., Ribeiro, R. P., & Gama, J. (2019). A study on imbalanced data streams. In European conference on machine learning and knowledge discovery in databases (pp. 380–389).
Aminian, E., Ribeiro, R. P., & Gama, J. (2021). Chebyshev approaches for imbalanced data streams regression models. Data Mining and Knowledge Discovery, 35(6), 2389–2466.
MathSciNet Google Scholar
Ancy, S., & Paulraj, D. (2020). Handling imbalanced data with concept drift by applying dynamic sampling and ensemble classification model. Computer Communications, 153, 553–560.
Google Scholar
Anupama, N., & Jena, S. (2019). A novel approach using incremental oversampling for data stream mining. Evolving Systems, 10(3), 351–362.
Google Scholar
Arya, M., & Hanumat Sastry, G. (2022). A novel deep ensemble learning framework for classifying imbalanced data stream. In IOT with smart systems (pp. 607–617).
Bahri, M., Bifet, A., Gama, J., Gomes, H. M., & Maniu, S. (2021). Data stream analysis: Foundations, major tasks and tools. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11, e1405.
Google Scholar
Barros, R. S. M., & Santos, S. G. T. C. (2018). A large-scale comparison of concept drift detectors. Information Sciences, 451, 348–370.
MathSciNet Google Scholar
Bernardo, A., & Della Valle, E. (2021a). SMOTE-OB: Combining SMOTE and online bagging for continuous rebalancing of evolving data streams. In IEEE international conference on big data (pp. 5033–5042).
Bernardo, A., & Della Valle, E. (2021b). VFC-SMOTE: Very fast continuous synthetic minority oversampling for evolving data streams. Data Mining and Knowledge Discovery, 35(6), 2679–2713.
MathSciNet Google Scholar
Bernardo, A., Della Valle, E., & Bifet, A. (2020a). Incremental rebalancing learning on evolving data streams. In International conference on data mining workshops (pp. 844–850).
Bernardo, A., Gomes, H. M., Montiel, J., Pfahringer, B., Bifet, A., & Della Valle, E. (2020b). C-SMOTE: Continuous synthetic minority oversampling for evolving data streams. In IEEE international conference on big data (pp. 483–492).
Bernardo, A., Ziffer, G., & Valle, E. D. (2021). IEBench: Benchmarking streaming learners on imbalanced evolving data streams. In: International conference on data mining (pp. 331–340).
Bhowmick, K., & Narvekar, M. (2022). A semi-supervised clustering-based classification model for classifying imbalanced data streams in the presence of scarcely labelled data. International Journal of Business Intelligence and Data Mining, 20(2), 170–191.
Google Scholar
Bian, S., & Wang, W. (2007). On diversity and accuracy of homogeneous and heterogeneous ensembles. International Journal of Hybrid Intelligent Systems, 4(2), 103–128.
Google Scholar
Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., & Gavalda, R. (2009). New ensemble methods for evolving data streams. In ACM SIGKDD international conference on knowledge discovery and data mining (pp. 139–148).
Bifet, A., Holmes, G., & Pfahringer, B. (2010a). Leveraging bagging for evolving data streams. In European conference on machine learning and knowledge discovery in databases (pp. 135–150).
Bifet, A., Holmes, G., Pfahringer, B., Kranen, P., Kremer, H., Jansen, T., & Seidl, T. (2010b). MOA: Massive online analysis, a framework for stream classification and clustering. In Workshop on applications of pattern analysis (pp. 44–50).
Bobowska, B., Klikowski, J., & Woźniak, M. (2019). Imbalanced data stream classification using hybrid data preprocessing. In European conference on machine learning and knowledge discovery in databases (pp. 402–413).
Bourdonnaye, F. D. L., & Daniel, F. (2022). Evaluating resampling methods on a real-life highly imbalanced online credit card payments dataset. CoRR arXiv:2206.13152.
Branco, P., Torgo, L., & Ribeiro, R. P. (2016). A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (CSUR), 49(2), 1–50.
Google Scholar
Branco, P., Torgo, L., & Ribeiro, R. P. (2017). SMOGN: a pre-processing approach for imbalanced regression. In International workshop on learning with imbalanced domains: Theory and applications (pp. 36–50).
Brzeziński, D., & Stefanowski, J. (2017). Prequential AUC: Properties of the area under the ROC curve for data streams with concept drift. Knowledge and Information Systems, 52(2), 531–562.
Google Scholar
Brzeziński, D., & Stefanowski, J. (2018). Ensemble classifiers for imbalanced and evolving data streams. In Data mining in time series and streaming databases (pp. 44–68). World Scientific.
Brzeziński, D., Stefanowski, J., Susmaga, R., & Szczȩch, I. (2018). Visual-based analysis of classification measures and their properties for class imbalanced problems. Information Sciences, 462, 242–261.
MathSciNet Google Scholar
Brzeziński, D., Stefanowski, J., Susmaga, R., & Szczech, I. (2019). On the dynamics of classification measures for imbalanced and streaming data. IEEE Transactions on Neural Networks and Learning Systems, 31(8), 2868–2878.
Google Scholar
Brzeziński, D., Minku, L. L., Pewinski, T., Stefanowski, J., & Szumaczuk, A. (2021). The impact of data difficulty factors on classification of imbalanced and concept drifting data streams. Knowledge and Information Systems, 63, 1429–1469.
Google Scholar
Cano, A., & Krawczyk, B. (2019). Evolving rule-based classifiers with genetic programming on GPUs for drifting data streams. Pattern Recognition, 87, 248–268.
Google Scholar
Cano, A., & Krawczyk, B. (2020). Kappa Updated Ensemble for drifting data stream mining. Machine Learning, 109, 175–218.
MathSciNet Google Scholar
Cano, A., & Krawczyk, B. (2022). ROSE: Robust online self-adjusting ensemble for continual learning on imbalanced drifting data streams. Machine Learning, 111, 2561–2599.
MathSciNet Google Scholar
Chrysakis, A., & Moens, M. (2020). Online continual learning from imbalanced data. International Conference on Machine Learning, 119, 1952–1961.
Google Scholar
Cieslak, D. A., & Chawla, N. V. (2008). Learning decision trees for unbalanced data. In European conference on machine learning and knowledge discovery in databases (pp. 241–256).
Czarnowski, I. (2021). Learning from imbalanced data streams based on over-sampling and instance selection. In International conference on computational science (pp. 378–391).
Czarnowski, I. (2022). Weighted Ensemble with one-class Classification and Over-sampling and Instance selection (WECOI): An approach for learning from imbalanced data streams. Journal of Computational Science, 61, 101614.
Google Scholar
da Costa, V. G. T., de Leon Ferreira, A. C. P., Junior, S. B., et al. (2018). Strict Very Fast Decision Tree: A memory conservative algorithm for data stream mining. Pattern Recognition Letters, 116, 22–28.
Google Scholar
Ditzler, G., Roveri, M., Alippi, C., & Polikar, R. (2015). Learning in nonstationary environments: A survey. IEEE Computational Intelligence Magazine, 10(4), 12–25.
Google Scholar
Du, H., Zhang, Y., Gang, K., Zhang, L., & Chen, Y. C. (2021). Online ensemble learning algorithm for imbalanced data stream. Applied Soft Computing, 107, 107378.
Google Scholar
Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets (Vol. 10). Springer.
Google Scholar
Fernández, A., García, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, 61, 863–905.
MathSciNet Google Scholar
Ferreira, L. E. B., Gomes, H. M., Bifet, A., & Oliveira, L. S. (2019). Adaptive random forests with resampling for imbalanced data streams. In International joint conference on neural networks (pp. 1–6).
Galar, M., Fernández, A., Barrenechea, E., & Herrera, F. (2014). Empowering difficult classes with a similarity-based aggregation in multi-class classification problems. Information Sciences, 264, 135–157.
MathSciNet Google Scholar
Gama, J. (2010). Knowledge discovery from data streams. CRC Press.
Google Scholar
Gama, J. (2012). A survey on learning from data streams: current and future trends. Progress in Artificial Intelligence, 1, 45–55.
Google Scholar
Gama, J., Sebastiao, R., & Rodrigues, P. P. (2013). On evaluating stream learning algorithms. Machine Learning, 90(3), 317–346.
MathSciNet Google Scholar
Gao, K. (2015). Online one-class SVMs with active-set optimization for data streams. In IEEE international conference on machine learning and applications (pp. 116–121).
García, V., Sánchez, J. S., & de Jesús Ochoa Domínguez H, Cleofas-Sánchez L,. (2015). Dissimilarity-based learning from imbalanced data with small disjuncts and noise. Pattern Recognition and Image Analysis, Lecture Notes in Computer Science,9117, 370–378.
Ghazikhani, A., Monsefi, R., & Yazdi, H. S. (2013). Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing, 122, 535–544.
Google Scholar
Ghazikhani, A., Monsefi, R., & Yazdi, H. S. (2014). Online neural network model for non-stationary and imbalanced data stream classification. International Journal of Machine Learning and Cybernetics, 5(1), 51–62.
Google Scholar
Gomes, H. M., Barddal, J. P., Enembreck, F., & Bifet, A. (2017a). A survey on ensemble learning for data stream classification. ACM Computing Surveys (CSUR), 50(2), 1–36.
Google Scholar
Gomes, H. M., Bifet, A., Read, J., Barddal, J. P., Enembreck, F., Pfharinger, B., Holmes, G., & Abdessalem, T. (2017b). Adaptive random forests for evolving data stream classification. Machine Learning, 106(9), 1469–1495.
MathSciNet Google Scholar
Gomes, H. M., Read, J., & Bifet, A. (2019). Streaming random patches for evolving data stream classification. In IEEE international conference on data mining (pp. 240–249).
Gomes, H. M., Grzenda, M., Mello, R., Read, J., Le Nguyen, M. H., & Bifet, A. (2022). A survey on semi-supervised learning for delayed partially labelled data streams. ACM Computing Surveys, 55(4), 1–42.
Google Scholar
Grzyb, J., Klikowski, J., & Woźniak, M. (2021). Hellinger distance weighted ensemble for imbalanced data stream classification. Journal of Computational Science, 51, 101314.
Google Scholar
Guo, N., Yu, Y., Song, M., Song, J., & Fu, Y. (2013). Soft-CsGDT: soft cost-sensitive Gaussian decision tree for cost-sensitive classification of data streams. International workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications (pp. 7–14).
Han, M., Chen, Z., Li, M., Wu, H., & Zhang, X. (2022). A survey of active and passive concept drift handling methods. Computational Intelligence, 38(4), 1492–1535.
Google Scholar
Han, M., Zhang, X., Chen, Z., Wu, H., & Li, M. (2023). Dynamic ensemble selection classification algorithm based on window over imbalanced drift data stream. Knowledge and Information Systems, 65(3), 1105–1128.
Google Scholar
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
Google Scholar
He, H., & Ma, Y. (2013). Imbalanced learning: Foundations, algorithms, and applications. Wiley.
Google Scholar
Iosifidis, V., Zhang, W., & Ntoutsi, E. (2021). Online fairness-aware learning with imbalanced data streams. arXiv preprint arXiv:2108.06231.
Japkowicz, N. (2013). Assessment metrics for imbalanced learning. Imbalanced learning: Foundations, algorithms, and applications (pp. 187–206).
Jedrzejowicz, J., & Jedrzejowicz, P. (2020). GEP-based classifier with drift detection for mining imbalanced data streams. Procedia Computer Science, 176, 41–49.
Google Scholar
Jiao, B., Guo, Y., Gong, D., & Chen, Q. (2022). Dynamic ensemble selection for imbalanced data streams with concept drift. IEEE Transactions on Neural Networks and Learning Systems.
Khamassi, I., Sayed-Mouchaweh, M., Hammami, M., & Ghédira, K. (2018). Discussion and review on evolving data streams and concept drift adapting. Evolving Systems, 9(1), 1–23.
Google Scholar
Kim, C. D., Jeong, J., & Kim, G. (2020). Imbalanced continual learning with partitioning reservoir sampling. In European conference on computer vision (vol. 12358, pp. 411–428).
Klikowski, J., & Woźniak, M. (2019). Multi sampling random subspace ensemble for imbalanced data stream classification. In International conference on computer recognition systems (pp. 360–369).
Klikowski, J., & Woźniak, M. (2020). Employing one-class SVM classifier ensemble for imbalanced data stream classification. In International conference on computational science (pp. 117–127).
Klikowski, J., & Wozniak, M. (2022). Deterministic sampling classifier with weighted bagging for drifted imbalanced data stream classification. Applied Soft Computing, 108855.
Komorniczak, J., Zyblewski, P., & Ksieniewicz, P. (2021). Prior probability estimation in dynamically imbalanced data streams. In International joint conference on neural networks (pp. 1–7).
Korycki, Ł., Cano, A., & Krawczyk, B. (2019). Active learning with abstaining classifiers for imbalanced drifting data streams. In IEEE international conference on big data (pp. 2334–2343).
Korycki, Ł., & Krawczyk, B. (2020). Online oversampling for sparsely labeled imbalanced and non-stationary data streams. In International joint conference on neural networks (pp. 1–8).
Korycki, L., & Krawczyk, B. (2021a). Class-incremental experience replay for continual learning under concept drift. In IEEE conference on computer vision and pattern recognition workshops (pp. 3649–3658).
Korycki, L., & Krawczyk, B. (2021b). Low-dimensional representation learning from imbalanced data streams. In Pacific-Asia conference on knowledge discovery and data mining (pp. 629–641).
Krawczyk, B. (2016). Learning from imbalanced data: Open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221–232.
Google Scholar
Krawczyk, B. (2021). Tensor decision trees for continual learning from drifting data streams. Machine Learning, 110(11), 3015–3035.
MathSciNet Google Scholar
Krawczyk, B., Galar, M., Wozniak, M., Bustince, H., & Herrera, F. (2018). Dynamic ensemble selection for multi-class classification with one-class classifiers. Pattern Recognition, 83, 34–51.
Google Scholar
Krawczyk, B., Koziarski, M., & Wozniak, M. (2020). Radial-based oversampling for multiclass imbalanced data classification. IEEE Transactions on Neural Networks and Learning Systems, 31(8), 2818–2831.
MathSciNet Google Scholar
Krawczyk, B., Minku, L. L., Gama, J., Stefanowski, J., & Woźniak, M. (2017). Ensemble learning for data stream analysis: A survey. Information Fusion, 37, 132–156.
Google Scholar
Krawczyk, B., & Skryjomski, P. (2017). Cost-sensitive perceptron decision trees for imbalanced drifting data streams. In European conference on machine learning and knowledge discovery in databases (pp. 512–527).
Krawczyk, B., & Wozniak, M. (2015). One-class classifiers with incremental learning and forgetting for data streams with concept drift. Soft Computing, 19(12), 3387–3400.
Google Scholar
Krempl, G., Žliobaitė, I., Brzeziński, D., Hüllermeier, E., Last, M., Lemaire, V., Noack, T., Shaker, A., Sievi, S., Spiliopoulou, M., et al. (2014). Open challenges for data stream mining research. ACM SIGKDD Explorations Newsletter, 16(1), 1–10.
Google Scholar
Ksieniewicz, P. (2021). The prior probability in the batch classification of imbalanced data streams. Neurocomputing, 452, 309–316.
Google Scholar
Ksieniewicz, P., & Zyblewski, P. (2022). Stream-learn—open-source python library for difficult data stream batch analysis. Neurocomputing, 478, 11–21.
Google Scholar
Lango, M., & Stefanowski, J. (2022). What makes multi-class imbalanced problems difficult? An experimental study. Expert Systems with Applications, 199, 116962.
Google Scholar
Lee, K. J. (2018). Online class imbalance learning for quality estimation in manufacturing. In IEEE international conference on emerging technologies and factory automation (pp. 1007–1014).
Li, Z., Huang, W., Xiong, Y., Ren, S., & Zhu, T. (2020). Incremental learning imbalanced data streams with concept drift: The dynamic updated ensemble algorithm. Knowledge-Based Systems, 195, 105694.
Google Scholar
Li-wen, W., Wei, G., & Yi-cheng, Y. (2021). An online weighted sequential extreme learning machine for class imbalanced data streams. Journal of Physics: Conference Series, 19–4(1), 012008.
Google Scholar
Liang, X., Song, X., Qi, K., Li, J., Liu, J., & Jian, L. (2021). Anomaly detection aided budget online classification for imbalanced data streams. IEEE Intelligent Systems, 36(3), 14–22.
Google Scholar
Lipska, A., & Stefanowski, J. (2022). The influence of multiple classes on learning online classifiers from imbalanced and concept drifting data streams. arXiv preprint arXiv:2210.08359.
Liu, W., Zhang, H., Ding, Z., Liu, Q., & Zhu, C. (2021). A comprehensive active learning method for multiclass imbalanced data streams with concept drift. Knowledge-Based Systems, 215, 106778.
Google Scholar
Liu, X., Fu, J., & Chen, Y. (2020). Event evolution model for cybersecurity event mining in tweet streams. Information Sciences, 524, 254–276.
Google Scholar
Loezer, L., Enembreck, F., Barddal, J. P., de Souza Britto, Jr. A. (2020). Cost-sensitive learning for imbalanced data streams. In ACM symposium on applied computing (pp. 498–504).
Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2018). Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 31(12), 2346–2363.
Google Scholar
Lu, Y., Cheung, Y.m., & Tang, Y.Y. (2017). Dynamic weighted majority for incremental learning of imbalanced data streams with concept drift. In International joint conference on artificial intelligence (pp. 2393–2399).
Lu, Y., Cheung, Y. M., & Tang, Y. Y. (2020). Adaptive chunk-based dynamic weighted majority for imbalanced data streams with concept drift. IEEE Transactions on Neural Networks and Learning Systems, 31, 2764–2778.
Google Scholar
Luong, A. V., Vu, T. H., Nguyen, P. M., Pham, N. V., McCall, J. A. W., Liew, A. W., & Nguyen, T. T. (2020). A homogeneous-heterogeneous ensemble of classifiers. In Neural information processing—27th international conference, ICONIP 2020, Bangkok, Thailand, November 18–22, 2020, Proceedings, Part V, Springer, Communications in Computer and Information Science, (vol. 1333, pp. 251–259).
Luque, A., Carrasco, A., Martín, A., & de Las, Heras A. (2019). The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition, 91, 216–231.
Google Scholar
Lyon, R. J., Knowles, J. D., Brooke, J. M., & Stappers, B. W. (2014). Hellinger distance trees for imbalanced streams. In IEEE international conference on pattern recognition (pp. 1969–1974).
Malialis, K., Panayiotou, C. G., & Polycarpou, M. M. (2022). Nonstationary data stream classification with online active learning and siamese neural networks. Neurocomputing, 512, 235–252.
Google Scholar
Marwa, T., Ouadfel, S., & Meshoul, S. (2021). Hybrid ensemble approaches to online harassment detection in highly imbalanced data. Expert Systems with Applications, 175, 114751.
Google Scholar
Masud, M. M., Al-Khateeb, T. M., Khan, L., Aggarwal, C., Gao, J., Han, J., Thuraisingham, B. (2011). Detecting recurring and novel classes in concept-drifting data streams. In IEEE international conference on data mining (pp. 1176–1181).
Masud, M. M., Chen, Q., Khan, L., Aggarwal, C. C., Gao, J., Han, J., Srivastava, A., & Oza, N. C. (2012). Classification and adaptive novel class detection of feature-evolving data streams. IEEE Transactions on Knowledge and Data Engineering, 25(7), 1484–1497.
Google Scholar
Masud, M. M., Chen, Q., Khan, L., Aggarwal, C., Gao, J., Han, J., & Thuraisingham, B. (2010b). Addressing concept-evolution in concept-drifting data streams. In IEEE international conference on data mining (pp. 929–934).
Masud, M. M., Gao, J., Khan, L., Han, J., & Thuraisingham, B. (2009). Integrating novel class detection with classification for concept-drifting data streams. In European conference on machine learning and knowledge discovery in databases (pp. 79–94).
Masud, M., Gao, J., Khan, L., Han, J., & Thuraisingham, B. M. (2010a). Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Transactions on Knowledge and Data Engineering, 23(6), 859–874.
Google Scholar
Mirza, B., Lin, Z., & Liu, N. (2015). Ensemble of subset online sequential extreme learning machine for class imbalance and concept drift. Neurocomputing, 149, 316–329.
Google Scholar
Mohammed, R. A., Wong, K. W., Shiratuddin, M. F., & Wang, X. (2020a). Classification of multi-class imbalanced data streams using a dynamic data-balancing technique. In International conference on neural information processing (pp. 279–290).
Mohammed, R. A., Wong, K. W., Shiratuddin, M. F., & Wang, X. (2020b). PWIDB: A framework for learning to classify imbalanced data streams with incremental data re-balancing technique. Procedia Computer Science, 176, 818–827.
Google Scholar
Montiel, J., Halford, M., Mastelini, S. M., Bolmier, G., Sourty, R., Vaysse, R., Zouitine, A., Gomes, H. M., Read, J., Abdessalem, T., & Bifet, A. (2020). River: Machine learning for streaming data in python. arxiv:2012.04740.
Montiel, J., Read, J., Bifet, A., & Abdessalem, T. (2018). Scikit-multiflow: A multi-output streaming framework. The Journal of Machine Learning Research, 19(1), 2915–2914.
Google Scholar
Napierala, K., & Stefanowski, J. (2016). Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of Intelligent Information Systems, 46(3), 563–597.
Google Scholar
Nguyen, H. L., Woon, Y. K., & Ng, W. K. (2015). A survey on data stream clustering and classification. Knowledge and Information Systems, 45, 535–569.
Google Scholar
Nguyen, V. L., Destercke, S., & Masson, M. H. (2018). Partial data querying through racing algorithms. International Journal of Approximate Reasoning, 96, 36–55.
MathSciNet Google Scholar
Peng, H., Sun, M., & Li, P. (2022). Optimal transport for long-tailed recognition with learnable cost matrix. In International conference on learning representations.
Priya, S., & Uthra, R. A. (2021). Deep learning framework for handling concept drift and class imbalanced complex decision-making on streaming data. Complex & Intelligent Systems 1–17.
Rabanser, S., Günnemann, S., & Lipton, Z. C. (2019). Failing loudly: An empirical study of methods for detecting dataset shift. In Neural information processing systems (pp. 1394–1406).
Read, J., & Žliobaitė, I. (2023). Learning from data streams: An overview and update. SSRN.
Ren, S., Liao, B., Zhu, W., Li, Z., Liu, W., & Li, K. (2018). The gradual resampling ensemble for mining imbalanced data streams with concept drift. Neurocomputing, 286, 150–166.
Google Scholar
Ren, S., Zhu, W., Liao, B., Li, Z., Wang, P., Li, K., Chen, M., & Li, Z. (2019). Selection-based resampling ensemble algorithm for nonstationary imbalanced stream data learning. Knowledge-Based Systems, 163, 705–722.
Google Scholar
Roseberry, M., Krawczyk, B., & Cano, A. (2019). Multi-label punitive kNN with self-adjusting memory for drifting data streams. ACM Transactions on Knowledge Discovery from Data, 13(6), 1–31.
Google Scholar
Sadeghi, F., & Viktor, H. L. (2021). Online-MC-Queue: Learning from imbalanced multi-class streams. In International workshop on learning with imbalanced domains: Theory and applications (pp. 21–34).
Santos, M. S., Abreu, P. H., Japkowicz, N., Fernández. A., Soares, C., Wilk, S., & Santos, J. (2022). On the joint-effect of class imbalance and overlap: a critical review. Artificial Intelligence Review (pp. 1–69).
Santos, M. S., Abreu, P. H., Japkowicz, N., Fernández, A., & Santos, J. (2023). A unifying view of class overlap and imbalance: Key concepts, multi-view panorama, and open avenues for research. Information Fusion, 89, 228–253.
Google Scholar
Shah, Z., & Dunn, A. G. (2022). Event detection on twitter by mapping unexpected changes in streaming data into a spatiotemporal lattice. IEEE Transactions on Big Data, 8(2), 508–522.
Google Scholar
Stefanowski, J. (2021). Classification of multi-class imbalanced data: Data difficulty factors and selected methods for improving classifiers. In International joint conference on rough sets (pp. 57–72).
Sudharsan, B., Breslin, J. G., & Ali, M. I. (2021). Imbal-OL: Online machine learning from imbalanced data streams in real-world IoT. In IEEE international conference on big data (pp. 4974–4978).
Sun, Y., Li, M., Li, L., Shao, H., & Sun, Y. (2021). Cost-sensitive classification for evolving data streams with concept drift and class imbalance. Computational Intelligence and Neuroscience (2021).
Sun, Y., Sun, Y., & Dai, H. (2020). Two-stage cost-sensitive learning for data streams with concept drift and class imbalance. IEEE Access, 8, 191942–191955.
Google Scholar
Sun, Y., Tang, K., Minku, L. L., Wang, S., & Yao, X. (2016). Online ensemble learning of data streams with gradually evolved classes. IEEE Transactions on Knowledge and Data Engineering, 28(6), 1532–1545.
Google Scholar
Vafaie, P., Viktor, H., & Michalowski, W. (2020). Multi-class imbalanced semi-supervised learning from streams through online ensembles. In International conference on data mining workshops (pp. 867–874).
Vaquet, V., & Hammer, B. (2020). Balanced SAM-kNN: Online learning with heterogeneous drift and imbalanced data. In International conference on artificial neural networks (pp. 850–862).
Vuttipittayamongkol, P., Elyan, E., & Petrovski, A. (2021). On the class overlap problem in imbalanced data classification. Knowledge Based Systems, 212, 106631.
Google Scholar
Wang, B., & Pineau, J. (2016). Online bagging and boosting for imbalanced data streams. IEEE Transactions on Knowledge and Data Engineering, 28(12), 3353–3366.
Google Scholar
Wang, L., Yan, Y., & Guo, W. (2021). Ensemble online weighted sequential extreme learning machine for class imbalanced data streams. In International symposium on computer engineering and intelligent communications (pp. 81–86).
Wang, S., & Minku, L. L. (2020). AUC estimation and concept drift detection for imbalanced data streams with multiple classes. In International joint conference on neural networks (pp. 1–8).
Wang, S., Minku, L. L., & Yao, X. (2015). Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering, 27(5), 1356–1368.
Google Scholar
Wang, S., Minku, L. L., & Yao, X. (2016). Dealing with multiple classes in online class imbalance learning. In International joint conference on artificial intelligence (pp. 2118–2124).
Wang, S., Minku, L. L., & Yao, X. (2018). A systematic study of online class imbalance learning with concept drift. IEEE Transactions on Neural Networks and Learning Systems, 29(10), 4802–4821.
Google Scholar
Wang, S., Minku, L. L., Chawla, N., & Yao, X. (2019). Learning from data streams and class imbalance.
Wang, T., Jin, X., Ding, X., & Ye, X. (2014). User interests imbalance exploration in social recommendation: A fitness adaptation. In ACM international conference on conference on information and knowledge management (pp. 281–290).
Wares, S., Isaacs, J., & Elyan, E. (2019). Data stream mining: methods and challenges for handling concept drift. SN Applied Sciences, 1, 1–19.
Google Scholar
Wasikowski, M., & Chen, X. (2010). Combating the small sample class imbalance problem using feature selection. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1388–1400.
Google Scholar
Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L., & Petitjean, F. (2016). Characterizing concept drift. Data Mining and Knowledge Discovery, 30(4), 964–994.
MathSciNet Google Scholar
Wu K, Edwards A, Fan W, Gao J, Zhang K (2014) Classifying imbalanced data streams via dynamic feature group weighting with importance sampling. In SIAM international conference on data mining (pp. 722–730).
Yan, Y., Yang, T., Yang, Y., & Chen, J. (2017). A framework of online learning with imbalanced streaming data. In AAAI conference on artificial intelligence (Vol. 31).
Yan, Z., Hongle, D., Gang, K., Lin, Z., & Chen, Y. C. (2022). Dynamic weighted selective ensemble learning algorithm for imbalanced data streams. The Journal of Supercomputing, 78(4), 5394–5419.
Google Scholar
Yang, L., Jiang, H., Song, Q., & Guo, J. (2022). A survey on long-tailed visual recognition. International Journal of Computer Vision, 130(7), 1837–1872.
Google Scholar
Zhang, H., Liu, W., & Liu, Q. (2022). Reinforcement online active learning ensemble for drifting imbalanced data streams. IEEE Transactions on Knowledge and Data Engineering, 34(8), 3971–3983.
Google Scholar
Zhang, H., Liu, W., Wang, S., Shan, J., & Liu, Q. (2019). Resample-based ensemble framework for drifting imbalanced data streams. IEEE Access, 7, 65103–65115.
Google Scholar
Zhao, Y., Chen, W., Tan, X., Huang, K., & Zhu, J. (2022). Adaptive logit adjustment loss for long-tailed visual recognition. In AAAI conference on artificial intelligence (pp. 3472–3480).
Zhu, R., Guo, Y., & Xue, J. H. (2020). Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognition Letters, 133, 217–223.
Google Scholar
Zhu, Z., Xing, H., & Xu, Y. (2022). Easy balanced mixing for long-tailed data. Knowledge-Based Systems, 248, 108816.
Google Scholar
Žliobaitė, I., Bifet, A., Pfahringer, B., & Holmes, G. (2013). Active learning with drifting streaming data. IEEE Transactions on Neural Networks and Learning Systems, 25(1), 27–39.
Google Scholar
Zyblewski, P., Ksieniewicz, P., & Woźniak, M. (2019). Classifier selection for highly imbalanced data streams with minority driven ensemble. In International conference on artificial intelligence and soft computing (pp. 626–635).
Zyblewski, P., Sabourin, R., & Woźniak, M. (2021). Preprocessed dynamic classifier ensemble selection for highly imbalanced drifted data streams. Information Fusion, 66, 138–154.
Google Scholar
Zyblewski, P., & Woźniak, M. (2021). Dynamic ensemble selection for imbalanced data stream classification with limited label access. In International conference on artificial intelligence and soft computing (pp. 217–226).

Download references

Acknowledgements

High Performance Computing resources provided by the High Performance Research Computing (HPRC) Core Facility at Virginia Commonwealth University (https://hprc.vcu.edu) were used for conducting the research reported in this work.

Funding

This research was partially supported by the 2018 VCU Presidential Research Quest Fund (Alberto Cano) and an Amazon AWS Machine Learning Research award (Alberto Cano & Bartosz Krawczyk).

Author information

Authors and Affiliations

Department of Computer Science, Virginia Commonwealth University, 401 W. Main St. ERB2334, Richmond, VA, 23284, USA
Gabriel Aguiar
Department of Computer Science, Virginia Commonwealth University, 401 W. Main St. ERB2316, Richmond, VA, 23284, USA
Bartosz Krawczyk
Department of Computer Science, Virginia Commonwealth University, 401 W. Main St. ERB2314, Richmond, VA, 23284, USA
Alberto Cano

Authors

Gabriel Aguiar
View author publications
You can also search for this author in PubMed Google Scholar
Bartosz Krawczyk
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Cano
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Gabriel Aguiar contributed to the manuscript preparation. Alberto Cano contributed to the experimental evaluation and manuscript preparation. Bartosz Krawczyk contributed to the manuscript preparation. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Alberto Cano.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Editor: Nuno Moniz, Paula Branco, Luís Torgo, Nathalie Japkowicz, Michal Wozniak, Shuo Wang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Aguiar, G., Krawczyk, B. & Cano, A. A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework. Mach Learn 113, 4165–4243 (2024). https://doi.org/10.1007/s10994-023-06353-6

Download citation

Received: 07 April 2022
Revised: 20 April 2023
Accepted: 16 May 2023
Published: 29 June 2023
Issue Date: July 2024
DOI: https://doi.org/10.1007/s10994-023-06353-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Abstract

Access this article

Similar content being viewed by others

Learning from Imbalanced Data Streams Based on Over-Sampling and Instance Selection

Learning from Imbalanced Data Streams Using Rotation-Based Ensemble Classifiers

Chebyshev approaches for imbalanced data streams regression models

Data availability

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Abstract

Access this article

Similar content being viewed by others

Learning from Imbalanced Data Streams Based on Over-Sampling and Instance Selection

Learning from Imbalanced Data Streams Using Rotation-Based Ensemble Classifiers

Chebyshev approaches for imbalanced data streams regression models

Data availability

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation