Efficient Data Mining by Active Learning
An important issue in data mining and knowledge discovery is the issue of data scalability. We propose an approach to this problem by applying active learning as a method for data selection. In particular, we propose and evaluate a selective sampling method that belongs to the general category of ‘uncertainty sampling,’ by adopting and extending the ‘query by bagging’ method, proposed earlier by the authors as a query learning method. We empirically evaluate the effectiveness of the proposed method by comparing its performance against Breiman’s Ivotes, a representative sampling method for scaling up inductive algorithms. Our results show that the performance of the proposed method compares favorably against that of Ivotes, both in terms of the predictive accuracy achieved using a fixed amount of computation time, and the final accuracy achieved. This is found to be especially the case when the data size approaches a million, a typical data size encountered in real world data mining applications. We have also examined the effect of noise in the data and found that the advantage of the proposed method becomes more pronounced for larger noise levels.
KeywordsData Mining Active Learn Predictive Accuracy Importance Sampling Inductive Algorithm
Unable to display preview. Download preview PDF.
- AM98.N. Abe and H. Mamitsuka. Query Learning Strategies Using Boosting and Bagging Proceedings of Fifteenth International Conference on Machine Learning, 1–9, 1998.Google Scholar
- C91.J. Catlett. Megainduction: Atest flight Proceedings of Eighth International Workshop on Machine Learning, 596–599, 1991.Google Scholar
- F98.J. Furnkranz Integrative windowing Journal of Artificial Intelligence Research 8:129–164, 1998.Google Scholar
- GGRL99.J. Gehrke and V. Ganti and R. Ramakrishnan and W-Y. Loh BOAT — Optimistic Decision Tree Construction Proceedings of the ACM SIGMOD International Conference on Management of Data, 169–180, 1999.Google Scholar
- Q83.J. R. Quinlan. Learning efficient classification procedures and their applications to chess endgames Machine Learning: An artificial intelligence approach, R. S. Michalski and J. G. Carbonell and T. M. Mitchell (Editors), San Francisco, Morgan Kaufmann, 1983.Google Scholar
- Q93.J. R. Quinlan C4.5: Programs for Machine Learning, San Francisco, Morgan Kaufmann, 1993.Google Scholar
- RS98.R. Rastogi and K. Shim Public: ADecision Tree Classifier that integrates building and pruning Proceedings of 24th International Conference on Very Large Data Bases, New York, Morgan Kaufmann, 404–415, 1998.Google Scholar