Analyzing concept drift and shift from sample data

Abstract

Concept drift and shift are major issues that greatly affect the accuracy and reliability of many real-world applications of machine learning. We propose a new data mining task, concept drift mapping—the description and analysis of instances of concept drift or shift. We argue that concept drift mapping is an essential prerequisite for tackling concept drift and shift. We propose tools for this purpose, arguing for the importance of quantitative descriptions of drift and shift in marginal distributions. We present quantitative concept drift mapping techniques, along with methods for visualizing their results. We illustrate their effectiveness for real-world applications across energy-pricing, vegetation monitoring and airline scheduling.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

References

  1. Aggarwal CC (2009) Data streams: an overview and scientific applications. Springer, Berlin, pp 377–397. https://doi.org/10.1007/978-3-642-02788-8_14

    Google Scholar 

  2. Baena-Garcıa M, del Campo-Ávila J, Fidalgo R, Bifet A, Gavalda R, Morales-Bueno R (2006) Early drift detection method. In: Fourth international workshop on knowledge discovery from data streams, vol 6, pp 77–86

  3. Bifet A, Gama J, Pechenizkiy M, Zliobaite I (2011) Handling concept drift: importance, challenges and solutions. PAKDD-2011 Tutorial, Shenzhen, China

  4. Bifet A, Read J, Pfahringer B, Holmes G, Žliobaite I (2013) CD-MOA: change detection framework for massive online analysis. In: International symposium on intelligent data analysis. Springer, Berlin, pp 92–103

  5. Brzezinski D, Stefanowski J (2014) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94

    Article  Google Scholar 

  6. Ditzler G, Roveri M, Alippi C, Polikar R (2015) Learning in nonstationary environments: a survey. IEEE Comput Intell Mag 10(4):12–25

    Article  Google Scholar 

  7. Dries A, Rückert U (2009) Adaptive concept drift detection. Stat Anal Data Min 2(5–6):311–327

    MathSciNet  Article  Google Scholar 

  8. Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM SIGMOD Rec 34(2):18–26

    Article  MATH  Google Scholar 

  9. Gama J, Žliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44:1–44:37. https://doi.org/10.1145/2523813

    Article  MATH  Google Scholar 

  10. Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Brazilian symposium on artificial intelligence. Springer, pp 286–295

  11. Gama J, Rodrigues P (2009) An overview on mining data streams, vol 206. Studies in computational intelligence. Springer, Berlin, pp 29–45. https://doi.org/10.1007/978-3-642-01091-0_2

    Google Scholar 

  12. Hagolle O, Sylvander S, Huc M, Claverie M, Clesse D, Dechoz C, Lonjou V, Poulain V (2015) Spot-4 (take 5): simulation of sentinel-2 time series on 45 large sites. Remote Sens 7(9):12242–12264. https://doi.org/10.3390/rs70912242

    Article  Google Scholar 

  13. Harries M (1999) Splice-2 comparative evaluation: electricity pricing. Technical Report UNSW-CSE-TR-9905, University of New South Wales

  14. Hellinger E (1909) Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die reine und angewandte Mathematik 136:210–271

    MathSciNet  MATH  Google Scholar 

  15. Hoens TR, Chawla NV, Polikar R (2011) Heuristic updatable weighted random subspaces for non-stationary environments. In: Cook DJ, Pei J, Wang W, Zaiane OR, Wu X (eds) IEEE international conference on data mining, ICDM-11. IEEE, pp 241–250

  16. Hoens TR, Polikar R, Chawla NV (2012) Learning from streaming data with concept drift and imbalance: an overview. Prog Artif Intell 1(1):89–101. https://doi.org/10.1007/s13748-011-0008-0

    Article  Google Scholar 

  17. Inglada J, Vincent A, Arias M, Tardy B, Morin D, Rodes I (2017) Operational high resolution land cover map production at the country scale using satellite image time series. Remote Sens. https://doi.org/10.3390/rs9010095

    Google Scholar 

  18. Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. In: Proceedings of the thirtieth international conference on very large data bases—volume 30, VLDB Endowment, VLDB ’04, pp 180–191

  19. Krempl G, Zliobaite I, Brzezinski D, Hullermeier E, Last M, Lemaire V, Noack T, Shaker A, Sievi S, Spiliopoulou M, Stefanowski J (2014) Open challenges for data stream mining research. ACM SIGKDD Explor Newsl 16–1:1–10

    Article  Google Scholar 

  20. Levin D, Peres Y, Wilmer E (2008) Markov chains and mixing times. American Mathematical Society, Providence

    Google Scholar 

  21. MOA dataset repository (2017) http://moa.cms.waikato.ac.nz/datasets/. Accessed 1 Sept 2017

  22. Moreno-Torres JG, Raeder T, Alaiz-Rodriguez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recognit 45(1):521–530

    Article  Google Scholar 

  23. Nguyen HL, Woon YK, Ng WK (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45:535–569

    Article  Google Scholar 

  24. Nishida K, Yamauchi K (2007) Detecting concept drift using statistical testing. In: International conference on discovery science. Springer, pp 264–269

  25. Pratt KB, Tschapek G (2003) Visualizing concept drift. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 735–740

  26. Qahtan AA, Alharbi B, Wang S, Zhang X (2015) A PCA-based change detection framework for multidimensional data streams: Change detection in multidimensional data streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 935–944

  27. Roarty M (1998) Electricity industry restructuring: the state of play. Research Paper 14, Science, Technology, Environment and Resources Group. http://www.aph.gov.au/About_Parliament/Parliamentary_Departments/Parliamentary_Library/pubs/rp/RP9798/98rp14. Accessed 1 Sept 2017

  28. Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Discov 30:964–994

    MathSciNet  Article  Google Scholar 

  29. Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1):69–101. https://doi.org/10.1007/BF00116900

    Google Scholar 

  30. Yao Y, Feng L, Chen F (2013) Concept drift visualization. J Inf Comput Sci 10(10):3021–3029

    Article  Google Scholar 

  31. Yu S, Abraham Z (2017) Concept drift detection with hierarchical hypothesis testing. In: Proceedings of the 2017 SIAM international conference on data mining. SIAM, pp 768–776

  32. Žliobaite I (2010) Learning under concept drift: an overview. CoRR arXiv:1010.4784

Download references

Acknowledgements

This work was supported by the Australian Research Council under Awards DP140100087 and DE170100037. This material is based upon work supported by the Air Force Office of Scientific Research, Asian Office of Aerospace Research and Development (AOARD) under Award Number FA2386-17-1-4033. The authors would like to thank the colleagues from CESBIO (Jordi Inglada, Arthur Vincent, Marcela Arias, Benjamin Tardy, David Morin and Isabel Rodes) for providing the Satellite dataset (data and labels).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Geoffrey I. Webb.

Additional information

Responsible editor: Jesse Davis, Elisa Fromont, Derek Greene, and Bjorn Bringmann.

Proof that drift magnitude is monotone under increasing dimensionality

Proof that drift magnitude is monotone under increasing dimensionality

We here prove that total variation distance is monotone under increasing dimensionality. The proof generalizes trivially to Hellinger distance. Note that where one set of variables is conditioned on another, it is the dimensionality of the conditioned variable rather than the conditioning variables over which this monotone increase in distance applies.

Let XZ be sets of covariates.

$$\begin{aligned} \sigma _{t,u}(X)\le & {} \sigma _{t,u}(X,Z) \\&\Updownarrow&\\ \frac{1}{2}\sum _{\begin{array}{c} {\bar{x}\in \mathrm{dom}(X)} \end{array}}\left| P_t(\bar{x})-P_u(\bar{x})\right|\le & {} \frac{1}{2}\sum _{\begin{array}{c} {\bar{x}\in \mathrm{dom}(X)} \\ {\bar{z}\in \mathrm{dom}(Z)} \end{array}}\left| P_t(\bar{x},\bar{z})-P_u(\bar{x},\bar{z})\right| \\&\Updownarrow&\\ \sum _{\begin{array}{c} {\bar{x}\in \mathrm{dom}(X)} \end{array}}\left| \sum _{\begin{array}{c} {\bar{z}\in \mathrm{dom}(Z)} \end{array}} P_t(\bar{x},\bar{z})- \sum _{\begin{array}{c} {\bar{z}\in \mathrm{dom}(Z)} \end{array}} P_u(\bar{x},\bar{z})\right|\le & {} \sum _{\begin{array}{c} {\bar{x}\in \mathrm{dom}(X)} \end{array}}\sum _{\begin{array}{c} {\bar{z}\in \mathrm{dom}(Z)} \end{array}}\left| P_t(\bar{x},\bar{z})-P_u(\bar{x},\bar{z})\right| \\&\Updownarrow&\\ \sum _{\begin{array}{c} {\bar{x}\in \mathrm{dom}(X)} \end{array}}\left| \sum _{\begin{array}{c} {\bar{z}\in \mathrm{dom}(Z)} \end{array}} P_t(\bar{x},\bar{z})-P_u(\bar{x},\bar{z})\right|\le & {} \sum _{\begin{array}{c} {\bar{x}\in \mathrm{dom}(X)} \end{array}}\sum _{\begin{array}{c} {\bar{z}\in \mathrm{dom}(Z)} \end{array}}\left| P_t(\bar{x},\bar{z})-P_u(\bar{x},\bar{z})\right| \end{aligned}$$

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Webb, G.I., Lee, L.K., Goethals, B. et al. Analyzing concept drift and shift from sample data. Data Min Knowl Disc 32, 1179–1199 (2018). https://doi.org/10.1007/s10618-018-0554-1

Download citation

Keywords

  • Concept drift
  • Concept shift
  • Non-stationary distribution
  • Visualisation
  • Mapping