Skip to main content
Log in

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures and benchmarks on how to evaluate these algorithms. This work proposes a standardized, exhaustive, and comprehensive experimental framework to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to a large-scale experimental study comparing state-of-the-art classifiers in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental framework is fully reproducible and easy to extend with new methods. This way, we propose a standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create complete, trustworthy, and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33
Fig. 34
Fig. 35
Fig. 36
Fig. 37
Fig. 38
Fig. 39
Fig. 40
Fig. 41
Fig. 42
Fig. 43
Fig. 44
Fig. 45

Similar content being viewed by others

Data availability

Data & materials available at https://github.com/canoalberto/imbalanced-streams.

Code availability

Source code is available at https://github.com/canoalberto/imbalanced-streams.

Notes

  1. Source code, experiments, and results are available at https://github.com/canoalberto/imbalanced-streams.

  2. Interactive plots and tables for all experiments are available at https://people.vcu.edu/~acano/imbalanced-streams.

  3. Complete results for all experiments are available at https://people.vcu.edu/~acano/imbalanced-streams.

  4. https://github.com/mlrep/imb-drift-20.

References

  • Abolfazli, A., & Ntoutsi, E. (2020). Drift-aware multi-memory model for imbalanced data streams. In IEEE international conference on big data (pp. 878–885).

  • Aguiar, G., & Cano, A. (2023). An active learning budget-based oversampling approach for partially labeled multi-class imbalanced data streams. In 38th ACM/SIGAPP symposium on applied computing (pp. 1–8).

  • Al-Khateeb, T., Masud, M. M., Khan, L., Aggarwal, C., Han, J., & Thuraisingham, B. (2012). Stream classification with recurring and novel class detection using class-based ensemble. In IEEE international conference on data mining (pp. 31–40).

  • Al-Shammari, A., Zhou, R., Naseriparsaa, M., & Liu, C. (2019). An effective density-based clustering and dynamic maintenance framework for evolving medical data streams. International Journal of Medical Informatics, 126, 176–186.

    Google Scholar 

  • Alberghini, G., Barbon, S., & Cano, A. (2022). Adaptive ensemble of self-adjusting nearest neighbor subspaces for multi-label drifting data streams. Neurocomputing, 481, 228–248.

    Google Scholar 

  • Aminian, E., Ribeiro, R. P., & Gama, J. (2019). A study on imbalanced data streams. In European conference on machine learning and knowledge discovery in databases (pp. 380–389).

  • Aminian, E., Ribeiro, R. P., & Gama, J. (2021). Chebyshev approaches for imbalanced data streams regression models. Data Mining and Knowledge Discovery, 35(6), 2389–2466.

    MathSciNet  Google Scholar 

  • Ancy, S., & Paulraj, D. (2020). Handling imbalanced data with concept drift by applying dynamic sampling and ensemble classification model. Computer Communications, 153, 553–560.

    Google Scholar 

  • Anupama, N., & Jena, S. (2019). A novel approach using incremental oversampling for data stream mining. Evolving Systems, 10(3), 351–362.

    Google Scholar 

  • Arya, M., & Hanumat Sastry, G. (2022). A novel deep ensemble learning framework for classifying imbalanced data stream. In IOT with smart systems (pp. 607–617).

  • Bahri, M., Bifet, A., Gama, J., Gomes, H. M., & Maniu, S. (2021). Data stream analysis: Foundations, major tasks and tools. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11, e1405.

    Google Scholar 

  • Barros, R. S. M., & Santos, S. G. T. C. (2018). A large-scale comparison of concept drift detectors. Information Sciences, 451, 348–370.

    MathSciNet  Google Scholar 

  • Bernardo, A., & Della Valle, E. (2021a). SMOTE-OB: Combining SMOTE and online bagging for continuous rebalancing of evolving data streams. In IEEE international conference on big data (pp. 5033–5042).

  • Bernardo, A., & Della Valle, E. (2021b). VFC-SMOTE: Very fast continuous synthetic minority oversampling for evolving data streams. Data Mining and Knowledge Discovery, 35(6), 2679–2713.

    MathSciNet  Google Scholar 

  • Bernardo, A., Della Valle, E., & Bifet, A. (2020a). Incremental rebalancing learning on evolving data streams. In International conference on data mining workshops (pp. 844–850).

  • Bernardo, A., Gomes, H. M., Montiel, J., Pfahringer, B., Bifet, A., & Della Valle, E. (2020b). C-SMOTE: Continuous synthetic minority oversampling for evolving data streams. In IEEE international conference on big data (pp. 483–492).

  • Bernardo, A., Ziffer, G., & Valle, E. D. (2021). IEBench: Benchmarking streaming learners on imbalanced evolving data streams. In: International conference on data mining (pp. 331–340).

  • Bhowmick, K., & Narvekar, M. (2022). A semi-supervised clustering-based classification model for classifying imbalanced data streams in the presence of scarcely labelled data. International Journal of Business Intelligence and Data Mining, 20(2), 170–191.

    Google Scholar 

  • Bian, S., & Wang, W. (2007). On diversity and accuracy of homogeneous and heterogeneous ensembles. International Journal of Hybrid Intelligent Systems, 4(2), 103–128.

    Google Scholar 

  • Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., & Gavalda, R. (2009). New ensemble methods for evolving data streams. In ACM SIGKDD international conference on knowledge discovery and data mining (pp. 139–148).

  • Bifet, A., Holmes, G., & Pfahringer, B. (2010a). Leveraging bagging for evolving data streams. In European conference on machine learning and knowledge discovery in databases (pp. 135–150).

  • Bifet, A., Holmes, G., Pfahringer, B., Kranen, P., Kremer, H., Jansen, T., & Seidl, T. (2010b). MOA: Massive online analysis, a framework for stream classification and clustering. In Workshop on applications of pattern analysis (pp. 44–50).

  • Bobowska, B., Klikowski, J., & Woźniak, M. (2019). Imbalanced data stream classification using hybrid data preprocessing. In European conference on machine learning and knowledge discovery in databases (pp. 402–413).

  • Bourdonnaye, F. D. L., & Daniel, F. (2022). Evaluating resampling methods on a real-life highly imbalanced online credit card payments dataset. CoRR arXiv:2206.13152.

  • Branco, P., Torgo, L., & Ribeiro, R. P. (2016). A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (CSUR), 49(2), 1–50.

    Google Scholar 

  • Branco, P., Torgo, L., & Ribeiro, R. P. (2017). SMOGN: a pre-processing approach for imbalanced regression. In International workshop on learning with imbalanced domains: Theory and applications (pp. 36–50).

  • Brzeziński, D., & Stefanowski, J. (2017). Prequential AUC: Properties of the area under the ROC curve for data streams with concept drift. Knowledge and Information Systems, 52(2), 531–562.

    Google Scholar 

  • Brzeziński, D., & Stefanowski, J. (2018). Ensemble classifiers for imbalanced and evolving data streams. In Data mining in time series and streaming databases (pp. 44–68). World Scientific.

  • Brzeziński, D., Stefanowski, J., Susmaga, R., & Szczȩch, I. (2018). Visual-based analysis of classification measures and their properties for class imbalanced problems. Information Sciences, 462, 242–261.

    MathSciNet  Google Scholar 

  • Brzeziński, D., Stefanowski, J., Susmaga, R., & Szczech, I. (2019). On the dynamics of classification measures for imbalanced and streaming data. IEEE Transactions on Neural Networks and Learning Systems, 31(8), 2868–2878.

    Google Scholar 

  • Brzeziński, D., Minku, L. L., Pewinski, T., Stefanowski, J., & Szumaczuk, A. (2021). The impact of data difficulty factors on classification of imbalanced and concept drifting data streams. Knowledge and Information Systems, 63, 1429–1469.

    Google Scholar 

  • Cano, A., & Krawczyk, B. (2019). Evolving rule-based classifiers with genetic programming on GPUs for drifting data streams. Pattern Recognition, 87, 248–268.

    Google Scholar 

  • Cano, A., & Krawczyk, B. (2020). Kappa Updated Ensemble for drifting data stream mining. Machine Learning, 109, 175–218.

    MathSciNet  Google Scholar 

  • Cano, A., & Krawczyk, B. (2022). ROSE: Robust online self-adjusting ensemble for continual learning on imbalanced drifting data streams. Machine Learning, 111, 2561–2599.

    MathSciNet  Google Scholar 

  • Chrysakis, A., & Moens, M. (2020). Online continual learning from imbalanced data. International Conference on Machine Learning, 119, 1952–1961.

    Google Scholar 

  • Cieslak, D. A., & Chawla, N. V. (2008). Learning decision trees for unbalanced data. In European conference on machine learning and knowledge discovery in databases (pp. 241–256).

  • Czarnowski, I. (2021). Learning from imbalanced data streams based on over-sampling and instance selection. In International conference on computational science (pp. 378–391).

  • Czarnowski, I. (2022). Weighted Ensemble with one-class Classification and Over-sampling and Instance selection (WECOI): An approach for learning from imbalanced data streams. Journal of Computational Science, 61, 101614.

    Google Scholar 

  • da Costa, V. G. T., de Leon Ferreira, A. C. P., Junior, S. B., et al. (2018). Strict Very Fast Decision Tree: A memory conservative algorithm for data stream mining. Pattern Recognition Letters, 116, 22–28.

    Google Scholar 

  • Ditzler, G., Roveri, M., Alippi, C., & Polikar, R. (2015). Learning in nonstationary environments: A survey. IEEE Computational Intelligence Magazine, 10(4), 12–25.

    Google Scholar 

  • Du, H., Zhang, Y., Gang, K., Zhang, L., & Chen, Y. C. (2021). Online ensemble learning algorithm for imbalanced data stream. Applied Soft Computing, 107, 107378.

    Google Scholar 

  • Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets (Vol. 10). Springer.

    Google Scholar 

  • Fernández, A., García, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, 61, 863–905.

    MathSciNet  Google Scholar 

  • Ferreira, L. E. B., Gomes, H. M., Bifet, A., & Oliveira, L. S. (2019). Adaptive random forests with resampling for imbalanced data streams. In International joint conference on neural networks (pp. 1–6).

  • Galar, M., Fernández, A., Barrenechea, E., & Herrera, F. (2014). Empowering difficult classes with a similarity-based aggregation in multi-class classification problems. Information Sciences, 264, 135–157.

    MathSciNet  Google Scholar 

  • Gama, J. (2010). Knowledge discovery from data streams. CRC Press.

    Google Scholar 

  • Gama, J. (2012). A survey on learning from data streams: current and future trends. Progress in Artificial Intelligence, 1, 45–55.

    Google Scholar 

  • Gama, J., Sebastiao, R., & Rodrigues, P. P. (2013). On evaluating stream learning algorithms. Machine Learning, 90(3), 317–346.

    MathSciNet  Google Scholar 

  • Gao, K. (2015). Online one-class SVMs with active-set optimization for data streams. In IEEE international conference on machine learning and applications (pp. 116–121).

  • García, V., Sánchez, J. S., & de Jesús Ochoa Domínguez H, Cleofas-Sánchez L,. (2015). Dissimilarity-based learning from imbalanced data with small disjuncts and noise. Pattern Recognition and Image Analysis, Lecture Notes in Computer Science,9117, 370–378.

  • Ghazikhani, A., Monsefi, R., & Yazdi, H. S. (2013). Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing, 122, 535–544.

    Google Scholar 

  • Ghazikhani, A., Monsefi, R., & Yazdi, H. S. (2014). Online neural network model for non-stationary and imbalanced data stream classification. International Journal of Machine Learning and Cybernetics, 5(1), 51–62.

    Google Scholar 

  • Gomes, H. M., Barddal, J. P., Enembreck, F., & Bifet, A. (2017a). A survey on ensemble learning for data stream classification. ACM Computing Surveys (CSUR), 50(2), 1–36.

    Google Scholar 

  • Gomes, H. M., Bifet, A., Read, J., Barddal, J. P., Enembreck, F., Pfharinger, B., Holmes, G., & Abdessalem, T. (2017b). Adaptive random forests for evolving data stream classification. Machine Learning, 106(9), 1469–1495.

    MathSciNet  Google Scholar 

  • Gomes, H. M., Read, J., & Bifet, A. (2019). Streaming random patches for evolving data stream classification. In IEEE international conference on data mining (pp. 240–249).

  • Gomes, H. M., Grzenda, M., Mello, R., Read, J., Le Nguyen, M. H., & Bifet, A. (2022). A survey on semi-supervised learning for delayed partially labelled data streams. ACM Computing Surveys, 55(4), 1–42.

    Google Scholar 

  • Grzyb, J., Klikowski, J., & Woźniak, M. (2021). Hellinger distance weighted ensemble for imbalanced data stream classification. Journal of Computational Science, 51, 101314.

    Google Scholar 

  • Guo, N., Yu, Y., Song, M., Song, J., & Fu, Y. (2013). Soft-CsGDT: soft cost-sensitive Gaussian decision tree for cost-sensitive classification of data streams. International workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications (pp. 7–14).

  • Han, M., Chen, Z., Li, M., Wu, H., & Zhang, X. (2022). A survey of active and passive concept drift handling methods. Computational Intelligence, 38(4), 1492–1535.

    Google Scholar 

  • Han, M., Zhang, X., Chen, Z., Wu, H., & Li, M. (2023). Dynamic ensemble selection classification algorithm based on window over imbalanced drift data stream. Knowledge and Information Systems, 65(3), 1105–1128.

    Google Scholar 

  • He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.

    Google Scholar 

  • He, H., & Ma, Y. (2013). Imbalanced learning: Foundations, algorithms, and applications. Wiley.

    Google Scholar 

  • Iosifidis, V., Zhang, W., & Ntoutsi, E. (2021). Online fairness-aware learning with imbalanced data streams. arXiv preprint arXiv:2108.06231.

  • Japkowicz, N. (2013). Assessment metrics for imbalanced learning. Imbalanced learning: Foundations, algorithms, and applications (pp. 187–206).

  • Jedrzejowicz, J., & Jedrzejowicz, P. (2020). GEP-based classifier with drift detection for mining imbalanced data streams. Procedia Computer Science, 176, 41–49.

    Google Scholar 

  • Jiao, B., Guo, Y., Gong, D., & Chen, Q. (2022). Dynamic ensemble selection for imbalanced data streams with concept drift. IEEE Transactions on Neural Networks and Learning Systems.

  • Khamassi, I., Sayed-Mouchaweh, M., Hammami, M., & Ghédira, K. (2018). Discussion and review on evolving data streams and concept drift adapting. Evolving Systems, 9(1), 1–23.

    Google Scholar 

  • Kim, C. D., Jeong, J., & Kim, G. (2020). Imbalanced continual learning with partitioning reservoir sampling. In European conference on computer vision (vol. 12358, pp. 411–428).

  • Klikowski, J., & Woźniak, M. (2019). Multi sampling random subspace ensemble for imbalanced data stream classification. In International conference on computer recognition systems (pp. 360–369).

  • Klikowski, J., & Woźniak, M. (2020). Employing one-class SVM classifier ensemble for imbalanced data stream classification. In International conference on computational science (pp. 117–127).

  • Klikowski, J., & Wozniak, M. (2022). Deterministic sampling classifier with weighted bagging for drifted imbalanced data stream classification. Applied Soft Computing, 108855.

  • Komorniczak, J., Zyblewski, P., & Ksieniewicz, P. (2021). Prior probability estimation in dynamically imbalanced data streams. In International joint conference on neural networks (pp. 1–7).

  • Korycki, Ł., Cano, A., & Krawczyk, B. (2019). Active learning with abstaining classifiers for imbalanced drifting data streams. In IEEE international conference on big data (pp. 2334–2343).

  • Korycki, Ł., & Krawczyk, B. (2020). Online oversampling for sparsely labeled imbalanced and non-stationary data streams. In International joint conference on neural networks (pp. 1–8).

  • Korycki, L., & Krawczyk, B. (2021a). Class-incremental experience replay for continual learning under concept drift. In IEEE conference on computer vision and pattern recognition workshops (pp. 3649–3658).

  • Korycki, L., & Krawczyk, B. (2021b). Low-dimensional representation learning from imbalanced data streams. In Pacific-Asia conference on knowledge discovery and data mining (pp. 629–641).

  • Krawczyk, B. (2016). Learning from imbalanced data: Open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221–232.

    Google Scholar 

  • Krawczyk, B. (2021). Tensor decision trees for continual learning from drifting data streams. Machine Learning, 110(11), 3015–3035.

    MathSciNet  Google Scholar 

  • Krawczyk, B., Galar, M., Wozniak, M., Bustince, H., & Herrera, F. (2018). Dynamic ensemble selection for multi-class classification with one-class classifiers. Pattern Recognition, 83, 34–51.

    Google Scholar 

  • Krawczyk, B., Koziarski, M., & Wozniak, M. (2020). Radial-based oversampling for multiclass imbalanced data classification. IEEE Transactions on Neural Networks and Learning Systems, 31(8), 2818–2831.

    MathSciNet  Google Scholar 

  • Krawczyk, B., Minku, L. L., Gama, J., Stefanowski, J., & Woźniak, M. (2017). Ensemble learning for data stream analysis: A survey. Information Fusion, 37, 132–156.

    Google Scholar 

  • Krawczyk, B., & Skryjomski, P. (2017). Cost-sensitive perceptron decision trees for imbalanced drifting data streams. In European conference on machine learning and knowledge discovery in databases (pp. 512–527).

  • Krawczyk, B., & Wozniak, M. (2015). One-class classifiers with incremental learning and forgetting for data streams with concept drift. Soft Computing, 19(12), 3387–3400.

    Google Scholar 

  • Krempl, G., Žliobaitė, I., Brzeziński, D., Hüllermeier, E., Last, M., Lemaire, V., Noack, T., Shaker, A., Sievi, S., Spiliopoulou, M., et al. (2014). Open challenges for data stream mining research. ACM SIGKDD Explorations Newsletter, 16(1), 1–10.

    Google Scholar 

  • Ksieniewicz, P. (2021). The prior probability in the batch classification of imbalanced data streams. Neurocomputing, 452, 309–316.

    Google Scholar 

  • Ksieniewicz, P., & Zyblewski, P. (2022). Stream-learn—open-source python library for difficult data stream batch analysis. Neurocomputing, 478, 11–21.

    Google Scholar 

  • Lango, M., & Stefanowski, J. (2022). What makes multi-class imbalanced problems difficult? An experimental study. Expert Systems with Applications, 199, 116962.

    Google Scholar 

  • Lee, K. J. (2018). Online class imbalance learning for quality estimation in manufacturing. In IEEE international conference on emerging technologies and factory automation (pp. 1007–1014).

  • Li, Z., Huang, W., Xiong, Y., Ren, S., & Zhu, T. (2020). Incremental learning imbalanced data streams with concept drift: The dynamic updated ensemble algorithm. Knowledge-Based Systems, 195, 105694.

    Google Scholar 

  • Li-wen, W., Wei, G., & Yi-cheng, Y. (2021). An online weighted sequential extreme learning machine for class imbalanced data streams. Journal of Physics: Conference Series, 19–4(1), 012008.

    Google Scholar 

  • Liang, X., Song, X., Qi, K., Li, J., Liu, J., & Jian, L. (2021). Anomaly detection aided budget online classification for imbalanced data streams. IEEE Intelligent Systems, 36(3), 14–22.

    Google Scholar 

  • Lipska, A., & Stefanowski, J. (2022). The influence of multiple classes on learning online classifiers from imbalanced and concept drifting data streams. arXiv preprint arXiv:2210.08359.

  • Liu, W., Zhang, H., Ding, Z., Liu, Q., & Zhu, C. (2021). A comprehensive active learning method for multiclass imbalanced data streams with concept drift. Knowledge-Based Systems, 215, 106778.

    Google Scholar 

  • Liu, X., Fu, J., & Chen, Y. (2020). Event evolution model for cybersecurity event mining in tweet streams. Information Sciences, 524, 254–276.

    Google Scholar 

  • Loezer, L., Enembreck, F., Barddal, J. P., de Souza Britto, Jr. A. (2020). Cost-sensitive learning for imbalanced data streams. In ACM symposium on applied computing (pp. 498–504).

  • Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2018). Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 31(12), 2346–2363.

    Google Scholar 

  • Lu, Y., Cheung, Y.m., & Tang, Y.Y. (2017). Dynamic weighted majority for incremental learning of imbalanced data streams with concept drift. In International joint conference on artificial intelligence (pp. 2393–2399).

  • Lu, Y., Cheung, Y. M., & Tang, Y. Y. (2020). Adaptive chunk-based dynamic weighted majority for imbalanced data streams with concept drift. IEEE Transactions on Neural Networks and Learning Systems, 31, 2764–2778.

    Google Scholar 

  • Luong, A. V., Vu, T. H., Nguyen, P. M., Pham, N. V., McCall, J. A. W., Liew, A. W., & Nguyen, T. T. (2020). A homogeneous-heterogeneous ensemble of classifiers. In Neural information processing—27th international conference, ICONIP 2020, Bangkok, Thailand, November 18–22, 2020, Proceedings, Part V, Springer, Communications in Computer and Information Science, (vol. 1333, pp. 251–259).

  • Luque, A., Carrasco, A., Martín, A., & de Las, Heras A. (2019). The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition, 91, 216–231.

    Google Scholar 

  • Lyon, R. J., Knowles, J. D., Brooke, J. M., & Stappers, B. W. (2014). Hellinger distance trees for imbalanced streams. In IEEE international conference on pattern recognition (pp. 1969–1974).

  • Malialis, K., Panayiotou, C. G., & Polycarpou, M. M. (2022). Nonstationary data stream classification with online active learning and siamese neural networks. Neurocomputing, 512, 235–252.

    Google Scholar 

  • Marwa, T., Ouadfel, S., & Meshoul, S. (2021). Hybrid ensemble approaches to online harassment detection in highly imbalanced data. Expert Systems with Applications, 175, 114751.

    Google Scholar 

  • Masud, M. M., Al-Khateeb, T. M., Khan, L., Aggarwal, C., Gao, J., Han, J., Thuraisingham, B. (2011). Detecting recurring and novel classes in concept-drifting data streams. In IEEE international conference on data mining (pp. 1176–1181).

  • Masud, M. M., Chen, Q., Khan, L., Aggarwal, C. C., Gao, J., Han, J., Srivastava, A., & Oza, N. C. (2012). Classification and adaptive novel class detection of feature-evolving data streams. IEEE Transactions on Knowledge and Data Engineering, 25(7), 1484–1497.

    Google Scholar 

  • Masud, M. M., Chen, Q., Khan, L., Aggarwal, C., Gao, J., Han, J., & Thuraisingham, B. (2010b). Addressing concept-evolution in concept-drifting data streams. In IEEE international conference on data mining (pp. 929–934).

  • Masud, M. M., Gao, J., Khan, L., Han, J., & Thuraisingham, B. (2009). Integrating novel class detection with classification for concept-drifting data streams. In European conference on machine learning and knowledge discovery in databases (pp. 79–94).

  • Masud, M., Gao, J., Khan, L., Han, J., & Thuraisingham, B. M. (2010a). Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Transactions on Knowledge and Data Engineering, 23(6), 859–874.

    Google Scholar 

  • Mirza, B., Lin, Z., & Liu, N. (2015). Ensemble of subset online sequential extreme learning machine for class imbalance and concept drift. Neurocomputing, 149, 316–329.

    Google Scholar 

  • Mohammed, R. A., Wong, K. W., Shiratuddin, M. F., & Wang, X. (2020a). Classification of multi-class imbalanced data streams using a dynamic data-balancing technique. In International conference on neural information processing (pp. 279–290).

  • Mohammed, R. A., Wong, K. W., Shiratuddin, M. F., & Wang, X. (2020b). PWIDB: A framework for learning to classify imbalanced data streams with incremental data re-balancing technique. Procedia Computer Science, 176, 818–827.

    Google Scholar 

  • Montiel, J., Halford, M., Mastelini, S. M., Bolmier, G., Sourty, R., Vaysse, R., Zouitine, A., Gomes, H. M., Read, J., Abdessalem, T., & Bifet, A. (2020). River: Machine learning for streaming data in python. arxiv:2012.04740.

  • Montiel, J., Read, J., Bifet, A., & Abdessalem, T. (2018). Scikit-multiflow: A multi-output streaming framework. The Journal of Machine Learning Research, 19(1), 2915–2914.

    Google Scholar 

  • Napierala, K., & Stefanowski, J. (2016). Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of Intelligent Information Systems, 46(3), 563–597.

    Google Scholar 

  • Nguyen, H. L., Woon, Y. K., & Ng, W. K. (2015). A survey on data stream clustering and classification. Knowledge and Information Systems, 45, 535–569.

    Google Scholar 

  • Nguyen, V. L., Destercke, S., & Masson, M. H. (2018). Partial data querying through racing algorithms. International Journal of Approximate Reasoning, 96, 36–55.

    MathSciNet  Google Scholar 

  • Peng, H., Sun, M., & Li, P. (2022). Optimal transport for long-tailed recognition with learnable cost matrix. In International conference on learning representations.

  • Priya, S., & Uthra, R. A. (2021). Deep learning framework for handling concept drift and class imbalanced complex decision-making on streaming data. Complex & Intelligent Systems 1–17.

  • Rabanser, S., Günnemann, S., & Lipton, Z. C. (2019). Failing loudly: An empirical study of methods for detecting dataset shift. In Neural information processing systems (pp. 1394–1406).

  • Read, J., & Žliobaitė, I. (2023). Learning from data streams: An overview and update. SSRN.

  • Ren, S., Liao, B., Zhu, W., Li, Z., Liu, W., & Li, K. (2018). The gradual resampling ensemble for mining imbalanced data streams with concept drift. Neurocomputing, 286, 150–166.

    Google Scholar 

  • Ren, S., Zhu, W., Liao, B., Li, Z., Wang, P., Li, K., Chen, M., & Li, Z. (2019). Selection-based resampling ensemble algorithm for nonstationary imbalanced stream data learning. Knowledge-Based Systems, 163, 705–722.

    Google Scholar 

  • Roseberry, M., Krawczyk, B., & Cano, A. (2019). Multi-label punitive kNN with self-adjusting memory for drifting data streams. ACM Transactions on Knowledge Discovery from Data, 13(6), 1–31.

    Google Scholar 

  • Sadeghi, F., & Viktor, H. L. (2021). Online-MC-Queue: Learning from imbalanced multi-class streams. In International workshop on learning with imbalanced domains: Theory and applications (pp. 21–34).

  • Santos, M. S., Abreu, P. H., Japkowicz, N., Fernández. A., Soares, C., Wilk, S., & Santos, J. (2022). On the joint-effect of class imbalance and overlap: a critical review. Artificial Intelligence Review (pp. 1–69).

  • Santos, M. S., Abreu, P. H., Japkowicz, N., Fernández, A., & Santos, J. (2023). A unifying view of class overlap and imbalance: Key concepts, multi-view panorama, and open avenues for research. Information Fusion, 89, 228–253.

    Google Scholar 

  • Shah, Z., & Dunn, A. G. (2022). Event detection on twitter by mapping unexpected changes in streaming data into a spatiotemporal lattice. IEEE Transactions on Big Data, 8(2), 508–522.

    Google Scholar 

  • Stefanowski, J. (2021). Classification of multi-class imbalanced data: Data difficulty factors and selected methods for improving classifiers. In International joint conference on rough sets (pp. 57–72).

  • Sudharsan, B., Breslin, J. G., & Ali, M. I. (2021). Imbal-OL: Online machine learning from imbalanced data streams in real-world IoT. In IEEE international conference on big data (pp. 4974–4978).

  • Sun, Y., Li, M., Li, L., Shao, H., & Sun, Y. (2021). Cost-sensitive classification for evolving data streams with concept drift and class imbalance. Computational Intelligence and Neuroscience (2021).

  • Sun, Y., Sun, Y., & Dai, H. (2020). Two-stage cost-sensitive learning for data streams with concept drift and class imbalance. IEEE Access, 8, 191942–191955.

    Google Scholar 

  • Sun, Y., Tang, K., Minku, L. L., Wang, S., & Yao, X. (2016). Online ensemble learning of data streams with gradually evolved classes. IEEE Transactions on Knowledge and Data Engineering, 28(6), 1532–1545.

    Google Scholar 

  • Vafaie, P., Viktor, H., & Michalowski, W. (2020). Multi-class imbalanced semi-supervised learning from streams through online ensembles. In International conference on data mining workshops (pp. 867–874).

  • Vaquet, V., & Hammer, B. (2020). Balanced SAM-kNN: Online learning with heterogeneous drift and imbalanced data. In International conference on artificial neural networks (pp. 850–862).

  • Vuttipittayamongkol, P., Elyan, E., & Petrovski, A. (2021). On the class overlap problem in imbalanced data classification. Knowledge Based Systems, 212, 106631.

    Google Scholar 

  • Wang, B., & Pineau, J. (2016). Online bagging and boosting for imbalanced data streams. IEEE Transactions on Knowledge and Data Engineering, 28(12), 3353–3366.

    Google Scholar 

  • Wang, L., Yan, Y., & Guo, W. (2021). Ensemble online weighted sequential extreme learning machine for class imbalanced data streams. In International symposium on computer engineering and intelligent communications (pp. 81–86).

  • Wang, S., & Minku, L. L. (2020). AUC estimation and concept drift detection for imbalanced data streams with multiple classes. In International joint conference on neural networks (pp. 1–8).

  • Wang, S., Minku, L. L., & Yao, X. (2015). Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering, 27(5), 1356–1368.

    Google Scholar 

  • Wang, S., Minku, L. L., & Yao, X. (2016). Dealing with multiple classes in online class imbalance learning. In International joint conference on artificial intelligence (pp. 2118–2124).

  • Wang, S., Minku, L. L., & Yao, X. (2018). A systematic study of online class imbalance learning with concept drift. IEEE Transactions on Neural Networks and Learning Systems, 29(10), 4802–4821.

    Google Scholar 

  • Wang, S., Minku, L. L., Chawla, N., & Yao, X. (2019). Learning from data streams and class imbalance.

  • Wang, T., Jin, X., Ding, X., & Ye, X. (2014). User interests imbalance exploration in social recommendation: A fitness adaptation. In ACM international conference on conference on information and knowledge management (pp. 281–290).

  • Wares, S., Isaacs, J., & Elyan, E. (2019). Data stream mining: methods and challenges for handling concept drift. SN Applied Sciences, 1, 1–19.

    Google Scholar 

  • Wasikowski, M., & Chen, X. (2010). Combating the small sample class imbalance problem using feature selection. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1388–1400.

    Google Scholar 

  • Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L., & Petitjean, F. (2016). Characterizing concept drift. Data Mining and Knowledge Discovery, 30(4), 964–994.

    MathSciNet  Google Scholar 

  • Wu K, Edwards A, Fan W, Gao J, Zhang K (2014) Classifying imbalanced data streams via dynamic feature group weighting with importance sampling. In SIAM international conference on data mining (pp. 722–730).

  • Yan, Y., Yang, T., Yang, Y., & Chen, J. (2017). A framework of online learning with imbalanced streaming data. In AAAI conference on artificial intelligence (Vol. 31).

  • Yan, Z., Hongle, D., Gang, K., Lin, Z., & Chen, Y. C. (2022). Dynamic weighted selective ensemble learning algorithm for imbalanced data streams. The Journal of Supercomputing, 78(4), 5394–5419.

    Google Scholar 

  • Yang, L., Jiang, H., Song, Q., & Guo, J. (2022). A survey on long-tailed visual recognition. International Journal of Computer Vision, 130(7), 1837–1872.

    Google Scholar 

  • Zhang, H., Liu, W., & Liu, Q. (2022). Reinforcement online active learning ensemble for drifting imbalanced data streams. IEEE Transactions on Knowledge and Data Engineering, 34(8), 3971–3983.

    Google Scholar 

  • Zhang, H., Liu, W., Wang, S., Shan, J., & Liu, Q. (2019). Resample-based ensemble framework for drifting imbalanced data streams. IEEE Access, 7, 65103–65115.

    Google Scholar 

  • Zhao, Y., Chen, W., Tan, X., Huang, K., & Zhu, J. (2022). Adaptive logit adjustment loss for long-tailed visual recognition. In AAAI conference on artificial intelligence (pp. 3472–3480).

  • Zhu, R., Guo, Y., & Xue, J. H. (2020). Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognition Letters, 133, 217–223.

    Google Scholar 

  • Zhu, Z., Xing, H., & Xu, Y. (2022). Easy balanced mixing for long-tailed data. Knowledge-Based Systems, 248, 108816.

    Google Scholar 

  • Žliobaitė, I., Bifet, A., Pfahringer, B., & Holmes, G. (2013). Active learning with drifting streaming data. IEEE Transactions on Neural Networks and Learning Systems, 25(1), 27–39.

    Google Scholar 

  • Zyblewski, P., Ksieniewicz, P., & Woźniak, M. (2019). Classifier selection for highly imbalanced data streams with minority driven ensemble. In International conference on artificial intelligence and soft computing (pp. 626–635).

  • Zyblewski, P., Sabourin, R., & Woźniak, M. (2021). Preprocessed dynamic classifier ensemble selection for highly imbalanced drifted data streams. Information Fusion, 66, 138–154.

    Google Scholar 

  • Zyblewski, P., & Woźniak, M. (2021). Dynamic ensemble selection for imbalanced data stream classification with limited label access. In International conference on artificial intelligence and soft computing (pp. 217–226).

Download references

Acknowledgements

High Performance Computing resources provided by the High Performance Research Computing (HPRC) Core Facility at Virginia Commonwealth University (https://hprc.vcu.edu) were used for conducting the research reported in this work.

Funding

This research was partially supported by the 2018 VCU Presidential Research Quest Fund (Alberto Cano) and an Amazon AWS Machine Learning Research award (Alberto Cano & Bartosz Krawczyk).

Author information

Authors and Affiliations

Authors

Contributions

Gabriel Aguiar contributed to the manuscript preparation. Alberto Cano contributed to the experimental evaluation and manuscript preparation. Bartosz Krawczyk contributed to the manuscript preparation. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Alberto Cano.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Editor: Nuno Moniz, Paula Branco, Luís Torgo, Nathalie Japkowicz, Michal Wozniak, Shuo Wang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aguiar, G., Krawczyk, B. & Cano, A. A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework. Mach Learn 113, 4165–4243 (2024). https://doi.org/10.1007/s10994-023-06353-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-023-06353-6

Keywords

Navigation