Advertisement

Data Mining and Knowledge Discovery

, Volume 31, Issue 5, pp 1242–1265 | Cite as

Classification of high-dimensional evolving data streams via a resource-efficient online ensemble

  • Tingting Zhai
  • Yang Gao
  • Hao Wang
  • Longbing Cao
Article
Part of the following topical collections:
  1. Journal Track of ECML PKDD 2017

Abstract

A novel online ensemble strategy, ensemble BPegasos (EBPegasos), is proposed to solve the problems simultaneously caused by concept drifting and the curse of dimensionality in classifying high-dimensional evolving data streams, which has not been addressed in the literature. First, EBPegasos uses BPegasos, an online kernelized SVM-based algorithm, as the component classifier to address the scalability and sparsity of high-dimensional data. Second, EBPegasos takes full advantage of the characteristics of BPegasos to cope with various types of concept drifts. Specifically, EBPegasos constructs diverse component classifiers by controlling the budget size of BPegasos; it also equips each component with a drift detector to monitor and evaluate its performance, and modifies the ensemble structure only when large performance degradation occurs. Such conditional structural modification strategy makes EBPegasos strike a good balance between exploiting and forgetting old knowledge. Lastly, we first prove experimentally that EBPegasos is more effective and resource-efficient than the tree ensembles on high-dimensional data. Then comprehensive experiments on synthetic and real-life datasets also show that EBPegasos can cope with various types of concept drifts significantly better than the state-of-the-art ensemble frameworks when all ensembles use BPegasos as the base learner.

Keywords

High dimensionality Concept drift Data stream classification Online ensemble 

Notes

Acknowledgements

This work is supported by the National NSF of China (Nos. 61432008, 61503178), NSF and Primary R&D Plan of Jiangsu Province, China (Nos. BE2015213, BK20150587), and the Collaborative Innovation Center of Novel Software Technology and Industrialization.

References

  1. Abdulsalam H, Skillicorn DB, Martin P (2007) Streaming random forests. In: 11th international database engineering and applications symposium, pp 225–232Google Scholar
  2. Abdulsalam H, Skillicorn DB, Martin P (2011) Classification using streaming random forests. IEEE Trans Knowl Data Eng 23(1):22–36CrossRefGoogle Scholar
  3. Abe S (2005) Support vector machines for pattern classification. Springer, LondonMATHGoogle Scholar
  4. Aggarwal CC, Yu PS (2008) Locust: an online analytical processing framework for high dimensional classification of data streams. In: Proceedings of the 24th IEEE international conference on data engineering, pp 426–435Google Scholar
  5. Bifet A, Frank E (2010) Sentiment knowledge discovery in twitter streaming data. In: International conference on discovery science, pp 1–15Google Scholar
  6. Bifet A, Gavalda R (2007) Learning from time-changing data with adaptive windowing. In: Proceedings of the 7th SIAM international conference on data mining, pp 443–448Google Scholar
  7. Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavaldà R (2009) New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 139–148Google Scholar
  8. Bifet A, Holmes G, Kirkby R, Pfahringer B (2010a) Moa: massive online analysis. J Mach Learn Res 11:1601–1604Google Scholar
  9. Bifet A, Holmes G, Pfahringer B (2010b) Leveraging bagging for evolving data streams. In: Joint European conference on machine learning and knowledge discovery in databases, pp 135–150Google Scholar
  10. Bifet A, Holmes G, Pfahringer B, Frank E (2010c) Fast perceptron decision tree learning from evolving data streams. In: Pacific-Asia conference on knowledge discovery and data mining, pp 299–310Google Scholar
  11. Bifet A, Pfahringer B, Read J, Holmes G (2013) Efficient data stream classification via probabilistic adaptive windows. In: Proceedings of the 28th annual ACM symposium on applied computing, pp 801–806Google Scholar
  12. Brzeziński D, Stefanowski J (2011) Accuracy updated ensemble for data streams with concept drift. In: International conference on hybrid artificial intelligence systems, pp 155–163Google Scholar
  13. Brzezinski D, Stefanowski J (2014a) Combining block-based and online methods in learning ensembles from concept drifting data streams. Inf Sci 265:50–67MathSciNetCrossRefMATHGoogle Scholar
  14. Brzezinski D, Stefanowski J (2014b) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94CrossRefGoogle Scholar
  15. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATHGoogle Scholar
  16. Denil M, Matheson D, De Freitas N (2013) Consistency of online random forests. In: Proceedings of the 30th international conference on machine learning, pp 1256–1264Google Scholar
  17. Do TN, Lenca P, Lallich S, Pham NK (2010) Classifying very-high-dimensional data with random forests of oblique decision trees. In: Guillet F, Ritschard G, Zighed DA, Briand H (eds) Advances in knowledge discovery and management. Springer, Berlin, Heidelberg, pp 39–55CrossRefGoogle Scholar
  18. Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, pp 71–80Google Scholar
  19. Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE Trans Neural Netw 22(10):1517–1531CrossRefGoogle Scholar
  20. Gama J, Fernandes R, Rocha R (2006) Decision trees for mining data streams. Intell Data Anal 10(1):23–45Google Scholar
  21. Gama J, Sebastiao R, Rodrigues PP (2013) On evaluating stream learning algorithms. Mach Learn 90(3):317–346MathSciNetCrossRefMATHGoogle Scholar
  22. Gama J, Zliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44CrossRefMATHGoogle Scholar
  23. Holmes G, Kirkby R, Pfahringer B (2005) Stress-testing hoeffding trees. In: European conference on principles of data mining and knowledge discovery, pp 495–502Google Scholar
  24. Hosseini MJ, Gholipour A, Beigy H (2015) An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams. Knowl Inf Syst 46:1–31Google Scholar
  25. Hsu CW, Chang CC, Lin CJ, et al (2003) A practical guide to support vector classification. https://www.cs.sfu.ca/people/Faculty/teaching/726/spring11/svmguide.pdf
  26. Katakis I, Tsoumakas G, Banos E, Bassiliades N, Vlahavas I (2009) An adaptive personalized news dissemination system. J Intell Inf Syst 32(2):191–212CrossRefGoogle Scholar
  27. Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22(3):371–391CrossRefGoogle Scholar
  28. Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8:2755–2790MATHGoogle Scholar
  29. Krempl G, Žliobaite I, Brzeziński D, Hüllermeier E, Last M, Lemaire V, Noack T, Shaker A, Sievi S, Spiliopoulou M, Stefanowski J (2014) Open challenges for data stream mining research. SIGKDD Explor 16(1):1–10CrossRefGoogle Scholar
  30. Lakshminarayanan B, Roy DM, Teh YW (2014) Mondrian forests: efficient online random forests. In: Advances in neural information processing systems 27: annual conference on neural information processing systems 2014, Montreal, Quebec, Canada, pp 3140–3148Google Scholar
  31. Liu Y, Zhou Y (2014) Online detection of concept drift in visual tracking. In: International conference on neural information processing, pp 159–166Google Scholar
  32. McCallum A, Nigam K et al (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization, vol 752, pp 41–48Google Scholar
  33. Minku LL, Yao X (2012) Ddd: A new ensemble approach for dealing with concept drift. IEEE Trans Knowl Data Eng 24(4):619–633CrossRefGoogle Scholar
  34. Minku LL, White AP, Yao X (2010) The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Trans Knowl Data Eng 22(5):730–742CrossRefGoogle Scholar
  35. Oza NC (2005) Online bagging and boosting. In: 2005 IEEE international conference on systems, man and cybernetics, vol 3, pp 2340–2345Google Scholar
  36. Pappu V, Pardalos PM (2014) High-dimensional data classification. In: Aleskerov F, Goldengorin B, Pardalos PM (eds) Clusters, orders, and trees: methods and applications. Springer, New York, pp 119–150Google Scholar
  37. Rutkowski L, Pietruczuk L, Duda P, Jaworski M (2013) Decision trees for mining data streams based on the McDiarmid’s bound. IEEE Trans Knowl Data Eng 25(6):1272–1279CrossRefGoogle Scholar
  38. Saffari A, Leistner C, Santner J, Godec M, Bischof H (2009) On-line random forests. In: 2009 IEEE 12th international conference on computer vision workshops, pp 1393–1400Google Scholar
  39. Shalev-Shwartz S, Singer Y, Srebro N, Cotter A (2011) Pegasos: primal estimated sub-gradient solver for SVM. Math Program 127(1):3–30MathSciNetCrossRefMATHGoogle Scholar
  40. Tomasev N, Radovanovic M, Mladenic D, Ivanovic M (2014) The role of hubness in clustering high-dimensional data. IEEE Trans Knowl Data Eng 26(3):739–751CrossRefGoogle Scholar
  41. Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale SVM training. J Mach Learn Res 13(1):3103–3131MathSciNetMATHGoogle Scholar
  42. Wang D, Wu P, Zhao P, Wu Y, Miao C, Hoi SC (2014) High-dimensional data stream classification via sparse online learning. In: 2014 IEEE international conference on data mining, pp 1007–1012Google Scholar
  43. Ye Y, Wu Q, Huang JZ, Ng MK, Li X (2013) Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognit 46(3):769–787CrossRefGoogle Scholar
  44. Zhang X, Furtlehner C, Germain-Renaud C, Sebag M (2014) Data stream clustering with affinity propagation. IEEE Trans Knowl Data Eng 26(7):1644–1656CrossRefGoogle Scholar
  45. Zliobaite I, Gabrys B (2014) Adaptive preprocessing for streaming data. IEEE Trans Knowl Data Eng 26(2):309–321CrossRefGoogle Scholar
  46. Zliobaite I, Bifet A, Read J, Pfahringer B, Holmes G (2015) Evaluation methods and decision theory for classification of streaming data with temporal dependence. Mach Learn 98(3):455–482MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© The Author(s) 2017

Authors and Affiliations

  1. 1.State Key Laboratory for Novel Software TechnologyNanjing UniversityNanjingChina
  2. 2.Advanced Analytics InstituteUniversity of Technology SydneySydneyAustralia

Personalised recommendations