Advertisement

Data Mining and Knowledge Discovery

, Volume 32, Issue 5, pp 1179–1199 | Cite as

Analyzing concept drift and shift from sample data

  • Geoffrey I. Webb
  • Loong Kuan Lee
  • Bart Goethals
  • François Petitjean
Article
Part of the following topical collections:
  1. Journal Track of ECML PKDD 2018

Abstract

Concept drift and shift are major issues that greatly affect the accuracy and reliability of many real-world applications of machine learning. We propose a new data mining task, concept drift mapping—the description and analysis of instances of concept drift or shift. We argue that concept drift mapping is an essential prerequisite for tackling concept drift and shift. We propose tools for this purpose, arguing for the importance of quantitative descriptions of drift and shift in marginal distributions. We present quantitative concept drift mapping techniques, along with methods for visualizing their results. We illustrate their effectiveness for real-world applications across energy-pricing, vegetation monitoring and airline scheduling.

Keywords

Concept drift Concept shift Non-stationary distribution Visualisation Mapping 

Notes

Acknowledgements

This work was supported by the Australian Research Council under Awards DP140100087 and DE170100037. This material is based upon work supported by the Air Force Office of Scientific Research, Asian Office of Aerospace Research and Development (AOARD) under Award Number FA2386-17-1-4033. The authors would like to thank the colleagues from CESBIO (Jordi Inglada, Arthur Vincent, Marcela Arias, Benjamin Tardy, David Morin and Isabel Rodes) for providing the Satellite dataset (data and labels).

References

  1. Aggarwal CC (2009) Data streams: an overview and scientific applications. Springer, Berlin, pp 377–397.  https://doi.org/10.1007/978-3-642-02788-8_14 Google Scholar
  2. Baena-Garcıa M, del Campo-Ávila J, Fidalgo R, Bifet A, Gavalda R, Morales-Bueno R (2006) Early drift detection method. In: Fourth international workshop on knowledge discovery from data streams, vol 6, pp 77–86Google Scholar
  3. Bifet A, Gama J, Pechenizkiy M, Zliobaite I (2011) Handling concept drift: importance, challenges and solutions. PAKDD-2011 Tutorial, Shenzhen, ChinaGoogle Scholar
  4. Bifet A, Read J, Pfahringer B, Holmes G, Žliobaite I (2013) CD-MOA: change detection framework for massive online analysis. In: International symposium on intelligent data analysis. Springer, Berlin, pp 92–103Google Scholar
  5. Brzezinski D, Stefanowski J (2014) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94CrossRefGoogle Scholar
  6. Ditzler G, Roveri M, Alippi C, Polikar R (2015) Learning in nonstationary environments: a survey. IEEE Comput Intell Mag 10(4):12–25CrossRefGoogle Scholar
  7. Dries A, Rückert U (2009) Adaptive concept drift detection. Stat Anal Data Min 2(5–6):311–327MathSciNetCrossRefGoogle Scholar
  8. Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM SIGMOD Rec 34(2):18–26CrossRefzbMATHGoogle Scholar
  9. Gama J, Žliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44:1–44:37.  https://doi.org/10.1145/2523813 CrossRefzbMATHGoogle Scholar
  10. Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Brazilian symposium on artificial intelligence. Springer, pp 286–295Google Scholar
  11. Gama J, Rodrigues P (2009) An overview on mining data streams, vol 206. Studies in computational intelligence. Springer, Berlin, pp 29–45.  https://doi.org/10.1007/978-3-642-01091-0_2 Google Scholar
  12. Hagolle O, Sylvander S, Huc M, Claverie M, Clesse D, Dechoz C, Lonjou V, Poulain V (2015) Spot-4 (take 5): simulation of sentinel-2 time series on 45 large sites. Remote Sens 7(9):12242–12264.  https://doi.org/10.3390/rs70912242 CrossRefGoogle Scholar
  13. Harries M (1999) Splice-2 comparative evaluation: electricity pricing. Technical Report UNSW-CSE-TR-9905, University of New South WalesGoogle Scholar
  14. Hellinger E (1909) Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die reine und angewandte Mathematik 136:210–271MathSciNetzbMATHGoogle Scholar
  15. Hoens TR, Chawla NV, Polikar R (2011) Heuristic updatable weighted random subspaces for non-stationary environments. In: Cook DJ, Pei J, Wang W, Zaiane OR, Wu X (eds) IEEE international conference on data mining, ICDM-11. IEEE, pp 241–250Google Scholar
  16. Hoens TR, Polikar R, Chawla NV (2012) Learning from streaming data with concept drift and imbalance: an overview. Prog Artif Intell 1(1):89–101.  https://doi.org/10.1007/s13748-011-0008-0 CrossRefGoogle Scholar
  17. Inglada J, Vincent A, Arias M, Tardy B, Morin D, Rodes I (2017) Operational high resolution land cover map production at the country scale using satellite image time series. Remote Sens.  https://doi.org/10.3390/rs9010095 Google Scholar
  18. Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. In: Proceedings of the thirtieth international conference on very large data bases—volume 30, VLDB Endowment, VLDB ’04, pp 180–191Google Scholar
  19. Krempl G, Zliobaite I, Brzezinski D, Hullermeier E, Last M, Lemaire V, Noack T, Shaker A, Sievi S, Spiliopoulou M, Stefanowski J (2014) Open challenges for data stream mining research. ACM SIGKDD Explor Newsl 16–1:1–10CrossRefGoogle Scholar
  20. Levin D, Peres Y, Wilmer E (2008) Markov chains and mixing times. American Mathematical Society, ProvidenceCrossRefGoogle Scholar
  21. MOA dataset repository (2017) http://moa.cms.waikato.ac.nz/datasets/. Accessed 1 Sept 2017
  22. Moreno-Torres JG, Raeder T, Alaiz-Rodriguez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recognit 45(1):521–530CrossRefGoogle Scholar
  23. Nguyen HL, Woon YK, Ng WK (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45:535–569CrossRefGoogle Scholar
  24. Nishida K, Yamauchi K (2007) Detecting concept drift using statistical testing. In: International conference on discovery science. Springer, pp 264–269Google Scholar
  25. Pratt KB, Tschapek G (2003) Visualizing concept drift. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 735–740Google Scholar
  26. Qahtan AA, Alharbi B, Wang S, Zhang X (2015) A PCA-based change detection framework for multidimensional data streams: Change detection in multidimensional data streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 935–944Google Scholar
  27. Roarty M (1998) Electricity industry restructuring: the state of play. Research Paper 14, Science, Technology, Environment and Resources Group. http://www.aph.gov.au/About_Parliament/Parliamentary_Departments/Parliamentary_Library/pubs/rp/RP9798/98rp14. Accessed 1 Sept 2017
  28. Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Discov 30:964–994MathSciNetCrossRefGoogle Scholar
  29. Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1):69–101.  https://doi.org/10.1007/BF00116900 Google Scholar
  30. Yao Y, Feng L, Chen F (2013) Concept drift visualization. J Inf Comput Sci 10(10):3021–3029CrossRefGoogle Scholar
  31. Yu S, Abraham Z (2017) Concept drift detection with hierarchical hypothesis testing. In: Proceedings of the 2017 SIAM international conference on data mining. SIAM, pp 768–776Google Scholar
  32. Žliobaite I (2010) Learning under concept drift: an overview. CoRR arXiv:1010.4784

Copyright information

© The Author(s) 2018

Authors and Affiliations

  • Geoffrey I. Webb
    • 1
  • Loong Kuan Lee
    • 1
  • Bart Goethals
    • 1
    • 2
  • François Petitjean
    • 1
  1. 1.Faculty of Information TechnologyMonash UniversityClaytonAustralia
  2. 2.Department of Mathematics and Computer ScienceUniversity of AntwerpAntwerpBelgium

Personalised recommendations