Efficient Data Mining by Active Learning

  • Hiroshi Mamitsuka
  • Naoki Abe
Chapter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2281)

Abstract

An important issue in data mining and knowledge discovery is the issue of data scalability. We propose an approach to this problem by applying active learning as a method for data selection. In particular, we propose and evaluate a selective sampling method that belongs to the general category of ‘uncertainty sampling,’ by adopting and extending the ‘query by bagging’ method, proposed earlier by the authors as a query learning method. We empirically evaluate the effectiveness of the proposed method by comparing its performance against Breiman’s Ivotes, a representative sampling method for scaling up inductive algorithms. Our results show that the performance of the proposed method compares favorably against that of Ivotes, both in terms of the predictive accuracy achieved using a fixed amount of computation time, and the final accuracy achieved. This is found to be especially the case when the data size approaches a million, a typical data size encountered in real world data mining applications. We have also examined the effect of noise in the data and found that the advantage of the proposed method becomes more pronounced for larger noise levels.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. AM98.
    N. Abe and H. Mamitsuka. Query Learning Strategies Using Boosting and Bagging Proceedings of Fifteenth International Conference on Machine Learning, 1–9, 1998.Google Scholar
  2. AIS93.
    R. Agrawal and T. Imielinski and A. Swami. Database Mining: A Performance Perspective IEEE Transactions on Knowledge and Data Engineering, 5(6):914–925, 1993.CrossRefGoogle Scholar
  3. Breiman96.
    L. Breiman. Bagging Predictors Machine Learning 24:123–140, 1996.MATHMathSciNetGoogle Scholar
  4. Breiman99.
    L. Breiman. Pasting Small Votes for Classification in Large Databases and on-line Machine Learning 36:85–103, 1999.CrossRefGoogle Scholar
  5. C91.
    J. Catlett. Megainduction: Atest flight Proceedings of Eighth International Workshop on Machine Learning, 596–599, 1991.Google Scholar
  6. FS97.
    Y. Freund and R. Schapire. Adecision-theoretic generalization of on-line learning and an application to boosting Journal of Computer and System Sciences 55(1), 119–139, 1997.MATHCrossRefMathSciNetGoogle Scholar
  7. F98.
    J. Furnkranz Integrative windowing Journal of Artificial Intelligence Research 8:129–164, 1998.Google Scholar
  8. GGRL99.
    J. Gehrke and V. Ganti and R. Ramakrishnan and W-Y. Loh BOAT — Optimistic Decision Tree Construction Proceedings of the ACM SIGMOD International Conference on Management of Data, 169–180, 1999.Google Scholar
  9. MST94.
    D. Michie and D. Spiegelhalter and C. Taylor (Editors). Machine Learning, Neural and Statistical Classification, Ellis Horwood, London, 1994.MATHGoogle Scholar
  10. PK99.
    F. Provost and V. Kolluri. A Survey of Methods for Scaling Up Inductive Algorithms Knowledge Discovery and Data Mining 3(2):131–169, 1999.CrossRefGoogle Scholar
  11. Q83.
    J. R. Quinlan. Learning efficient classification procedures and their applications to chess endgames Machine Learning: An artificial intelligence approach, R. S. Michalski and J. G. Carbonell and T. M. Mitchell (Editors), San Francisco, Morgan Kaufmann, 1983.Google Scholar
  12. Q93.
    J. R. Quinlan C4.5: Programs for Machine Learning, San Francisco, Morgan Kaufmann, 1993.Google Scholar
  13. RS98.
    R. Rastogi and K. Shim Public: ADecision Tree Classifier that integrates building and pruning Proceedings of 24th International Conference on Very Large Data Bases, New York, Morgan Kaufmann, 404–415, 1998.Google Scholar
  14. SOS92.
    H. S. Seung and M. Opper and H. Sompolinsky. Query by committee Proceedings of 5th Annual Workshop on Computational Learning Theory, 287–294, New York, ACM Press, 1992.CrossRefGoogle Scholar
  15. WI98.
    S. M. Weiss and N. Indurkhya Predictive Data Mining, Morgan Kaufmann, San Francisco, 1998.MATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Hiroshi Mamitsuka
    • 1
  • Naoki Abe
    • 2
  1. 1.Computational Engineering Technology Group Computer and Communications ResearchNEC CorporationJapan
  2. 2.Mathematical Sciences DepartmentIBM Thomas J. Watson Research CenterJapan

Personalised recommendations