Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms
 Carlos Domingo,
 Ricard Gavaldà,
 Osamu Watanabe
 … show all 3 hide
Abstract
Scalability is a key requirement for any KDD and data mining algorithm, and one of the biggest research challenges is to develop methods that allow to use large amounts of data. One possible approach for dealing with huge amounts of data is to take a random sample and do data mining on it, since for many data mining applications approximate answers are acceptable. However, as argued by several researchers, random sampling is difficult to use due to the difficulty of determining an appropriate sample size. In this paper, we take a sequential sampling approach for solving this difficulty, and propose an adaptive sampling algorithm that solves a general problem covering many problems arising in applications of discovery science. The algorithm obtains examples sequentially in an online fashion, and it determines from the obtained examples whether it has already seen a large enough number of examples. Thus, sample size is not fixed a priori; instead, it adaptively depends on the situation. Due to this adaptiveness, if we are not in a worst case situation as fortunately happens in many practical applications, then we can solve the problem with a number of examples much smaller than the required in the worst case. For illustrating the generality of our approach, we also describe how different instantiations of it can be applied to scale up knowledge discovery problems that appear in several areas.
 Carlos Domingo, Ricard Gavaldà and Osamu Watanabe. Practical Algorithms for Online Selection. In Proceedings of the First International Conference on Discovery Science, DS’98. Lecture Notes in Artificial Intelligence 1532:150–161, 1998.
 Carlos Domingo, Ricard Gavaldà and Osamu Watanabe. Online Sampling Methods for Discovering Association Rules. Tech Rep. C126, Dept. of Math and Computing Science, Tokyo Institute of Technology, 1999.
 Carlos Domingo, Ricard Gavaldà and Osamu Watanabe. Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms. Tech Rep. C131, Dept. of Math and Computing Science, Tokyo Institute of Technology. (www.is.titech.ac.jp/research/researchreport/C/index.html), 1999.
 Freund, Y., Schapire, R.E. (1997) A decisiontheoretic generalization of online learning and an application to boosting. JCSS 55: pp. 119139
 George H. John and Pat Langley. Static Versus Dynamic Sampling for Data Mining. In Proc. of the Second International Conference on Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.
 Michael Kearns and Yishay Mansour. On the boosting ability of topdown decision tree learning algorithms. In Proc. of 28th Annual ACM Symposium on the Theory of Computing, pp. 459–468, 1996.
 M.J. Kearns and U.V. Vazirani. An Introduction to Computational Learning Theory. Cambridge University Press, 1994.
 Jyrki Kivinen and Heikki Mannila. The power of sampling in knowledge discovery. In Proceedings of the ACM SIGACTSIGMODSIGACT Symposium on Principles of Database Theory, pp.77–85, 1994.
 Lipton, R. J., Naughton, J. F. (1995) Query Size Estimation by Adaptive Sampling. Journal of Computer and System Science 51: pp. 1825 CrossRef
 Lipton, R. J., Naughton, J. F., Schneider, D. A., Seshadri, S. (1993) Efficient sampling strategies for relational database operations. Theoretical Computer Science 116: pp. 195226 CrossRef
 Maron, O., Moore, A. W. (1994) Hoeffding races: Accelerating model selection search for classification and function approximation. Advances in Neural Information Processing Systems 6: pp. 5966
 Andrew W. Moore and M.S. Lee. Efficient algorithms for minimizing cross validation error. In Proc. of the 11th Int. Conference on Machine Learning, pp. 190–198, 1994.
 Ron Musick, Jason Catlett and Stuart Russell. Decision Theoretic Subsampling for Induction on Large Databases. In Proceedings of the 10th International Conference on Machine Learning, pp.212–219, 1993.
 Hannu Toivonen. Sampling large databases for association rules. In Proceedings of the 22nd International Conference on Very Large Databases, pages 134–145, 1996.
 Abraham Wald. Sequential Analysis. Wiley Mathematical, Statistics Series, 1947.
 Min Wang, Bala Iyer and Jeffrey Scott Vitter. Scalable Mining for Classification Rules in Relational Databases. In Proceedings of IDEAS’98, pp. 58–67, 1998.
 Stefan Wrobel. An algorithm for multirelational discovery of subgroups. In Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery, pp.78–87, 1997.
 Title
 Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms
 Book Title
 Discovery Science
 Book Subtitle
 Second International Conference, DS’99 Tokyo, Japan, December 6–8, 1999 Proceedings
 Pages
 pp 172183
 Copyright
 1999
 DOI
 10.1007/3540468463_16
 Print ISBN
 9783540667131
 Online ISBN
 9783540468462
 Series Title
 Lecture Notes in Computer Science
 Series Volume
 1721
 Series ISSN
 03029743
 Publisher
 Springer Berlin Heidelberg
 Copyright Holder
 SpringerVerlag Berlin Heidelberg
 Additional Links
 Topics
 Industry Sectors
 eBook Packages
 Editors

 Setsuo Arikawa ^{(1)}
 Koichi Furukawa ^{(2)}
 Editor Affiliations

 1. Department of Informatics, Kyushu University
 2. Graduate School of Media and Governance, Keio University
 Authors

 Carlos Domingo ^{(5)}
 Ricard Gavaldà ^{(6)}
 Osamu Watanabe ^{(5)}
 Author Affiliations

 5. Dept. of Math. and Comp. Science, Tokyo Institute of Technology, Tokyo, Japan
 6. Dept. of LSI, Universitat Politècnica de Catalunya, Barcelona, Spain
Continue reading...
To view the rest of this content please follow the download PDF link above.