Unsupervised Change Analysis Using Supervised Learning

  • Shohei Hido
  • Tsuyoshi Idé
  • Hisashi Kashima
  • Harunobu Kubo
  • Hirofumi Matsuzawa
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5012)

Abstract

We propose a formulation of a new problem, which we call change analysis, and a novel method for solving the problem. In contrast to the existing methods of change (or outlier) detection, the goal of change analysis goes beyond detecting whether or not any changes exist. Its ultimate goal is to find the explanation of the changes. While change analysis falls in the category of unsupervised learning in nature, we propose a novel approach based on supervised learning to achieve the goal. The key idea is to use a supervised classifier for interpreting the changes. A classifier should be able to discriminate between the two data sets if they actually come from two different data sources. In other words, we use a hypothetical label to train the supervised learner, and exploit the learner for interpreting the change. Experimental results using real data show the proposed approach is promising in change analysis as well as concept drift analysis.

Keywords

change analysis two-sample test concept drift 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Friedman, J., Rafsky, L.: Multivariate generalizations of the Wolfowitz and Smirnov two-sample tests. Annals of Statistics 7, 697–717 (1979)MATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Henze, Z.: A multivariate two-sample test based on the number of nearest neighbor type coincidences. Annals of Statistics 16, 772–783 (1988)MATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.J.: A kernel method for the two-sample-problem. In: Advances in Neural Information Processing Systems 19, pp. 513–520. MIT Press, Cambridge (2007)Google Scholar
  4. 4.
    Stuart, A., Ord, J.K.: Kendall’s Advanced Theory of Statistics, 6th edn., vol. 1. Arnold Publishers Inc. (1998)Google Scholar
  5. 5.
    Fan, W.: Streamminer: A classifier ensemble-based engine to mine concept-drifting data streams. In: Proc. the 30th Intl. Conf. Very Large Data Bases, pp. 1257–1260 (2004)Google Scholar
  6. 6.
    Wang, H., Yin, J., Pei, J., Yu, P.S., Yu, J.X.: Suppressing model overfitting in mining concept-drifting data streams. In: Proc. the 12th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining, pp. 20–23 (2006)Google Scholar
  7. 7.
    Yang, Y., Wu, X., Zhu, X.: Combining proactive and reactive predictions for data streams. In: Proc. the 11th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining, pp. 710–715 (2005)Google Scholar
  8. 8.
    Zech, G., Aslan, B.: A multivariate two-sample test based on the concept of minimum energy. In: Proceedings of Statistical Problems in Particle Physics, Astrophysics, and Cosmology, pp. 8–11 (2003)Google Scholar
  9. 9.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools. Morgan Kaufmann, San Francisco (2005)MATHGoogle Scholar
  10. 10.
    Newman, D., Hettich, S., Blake, C., Merz, C.: UCI repository of machine learning databases (1998)Google Scholar
  11. 11.
    Klimt, B., Yang, Y.: The Enron Corpus: A New Dataset for Email Classification Research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004)Google Scholar
  12. 12.
    Other forms of the Enron data: http://www.cs.queensu.ca/~skill/otherforms.html
  13. 13.
    Ganti, V., Gehrke, J.E., Ramakrishnan, R., Loh, W.: A framework for measuring changes in data characteristics. Journal of Computer and System Sciences 64(3), 542–578 (2002)MATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, San Francisco (1988)Google Scholar
  15. 15.
    Scholz, M., Klinkenberg, R.: Boosting classifiers for drifting concepts. Intelligent Data Analysis Journal 11(1), 3–28 (2007)Google Scholar
  16. 16.
    Street, W.N., Kim, Y.: A streaming ensemble algorithm (SEA) for large-scale classification. In: Proc. the 7th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining, pp. 377–382 (2001)Google Scholar
  17. 17.
    Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proc. the 7th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining, pp. 97–106 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Shohei Hido
    • 1
  • Tsuyoshi Idé
    • 1
  • Hisashi Kashima
    • 1
  • Harunobu Kubo
    • 1
  • Hirofumi Matsuzawa
    • 1
  1. 1.IBM ResearchTokyo Research LaboratoryKanagawaJapan

Personalised recommendations