Skip to main content

Abstract

Training with too much data can lead to substantial computational cost. Furthermore, the creation, collection, or procurement of data may be expensive. Unfortunately, the minimum sufficient training-set size seldom can be known a priori. We describe and analyze several methods for progressive sampling—using progressively larger samples as long as model accuracy improves. We explore several notions of efficient progressive sampling, including both methods that are asymptotically optimal and those that take into account prior expectations of appropriate data size. We then show empirically that progressive sampling indeed can be remarkably efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bellman, R. E. (1957). Dynamic programming. Princeton University Press.

    MATH  Google Scholar 

  • Blake, C., Keogh, E., and Merz, C. (1999). UCI repository of machine learning databases, http : //www.ics.uci.edu/~mlearn/MLReposito_ ry.html.

    Google Scholar 

  • Catlett, J. (1991a). Megainduction: A test flight. In Proceedings of the Eighth International Workshop on Machine Learning, pages 596–599. Morgan Kaufmann.

    Google Scholar 

  • Catlett, J. (1991b). Megainduction: Machine learning on very large databases. PhD thesis, School of Computer Science, University of Technology, Sydney, Australia.

    Google Scholar 

  • Frey, L. J. and Fisher, D. H. (1999). Modeling decision tree performance with the power law. In Heckerman, D. and Whittaker, J., editors, Proceedings of the Seventh International Workshop on Artificial Intelligence and Statistics. San Francisco, CA: Morgan Kaufmann.

    Google Scholar 

  • FĂĽrnkranz, J. (1998). Integrative windowing. Journal of Artificial Intelligence Research, 8:129–164.

    MATH  Google Scholar 

  • Harris-Jones, C. and Haines, T. L. (1997). Sample size and misclassifi-cation: Is more always better? Working Paper AMSCAT-WP-97–118, AMS Center for Advanced Technologies.

    Google Scholar 

  • Haussier, D., Kearns, M., Seung, H. S., and Tishby, N. (1996). Rigorous learning curve bounds from statistical mechanics. Machine Learning, 25:195–236.

    Article  Google Scholar 

  • John, G. and Langley, P. (1996). Static versus dynamic sampling for data mining. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 367–370. AAAI Press.

    Google Scholar 

  • Korf, R. (1985). Depth-first iterative deepening: An optimal admissible tree search. Artificial Intelligence, 27:97–109.

    Article  MathSciNet  MATH  Google Scholar 

  • Musick, R., Catlett, J., and Russell, S. (1993). Decision theoretic sub-sampling for induction on large databases. In Proceedings of the Tenth International Conference on Machine Learning, pages 212–219, San Mateo, CA. Morgan Kaufmann.

    Google Scholar 

  • Oates, T. and Jensen, D. (1997). The effects of training set size on decision tree complexity. In Fisher, D., editor, Machine Learning: Proceedings of the Fourteenth International Conference, pages 254–262. Morgan Kaufmann.

    Google Scholar 

  • Oates, T. and Jensen, D. (1998). Large data sets lead to overly complex models: an explanation and a solution. In Agrawal, R. and Stolorz, P., editors, Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-99), pages 294–298. Menlo Park, CA: AAAI Press.

    Google Scholar 

  • Provost, F. and Buchanan, B. (1995). Inductive policy: The pragmatics of bias selection. Machine Learning, 20:35–61.

    Google Scholar 

  • Provost, F., Jensen, D., and Oates, T. (1999). Efficient progressive sampling. In Proceedings of the SIGKDD Fifth International Conference on Knowledge Discovery and Data Mining.

    Google Scholar 

  • Provost, F. and Kolluri, V. (1999). A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2):131–169.

    Article  Google Scholar 

  • Provost, F. J. (1993). Iterative weakening: Optimal and near-optimal policies for the selection of search bias. In Proceedings of the Eleventh National Conference on Artificial Intelligence, pages Menlo Park, CA, 749–755. AAAI Press.

    Google Scholar 

  • Quinlan, J. (1983). Learning efficient classification procedures and their application to chess endgames. In Michalski, R.J.C., and Mitchell, T., editors, Machine Learning: An AI approach, pages 463–482. Morgan Kaufmann., Los Altos, CA.

    Google Scholar 

  • Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California.

    Google Scholar 

  • Simon, H. and Lea, G. (1973). Problem solving and rule induction: A unified view. In Gregg, editor, Knowledge and Cognition, pages 105–127. Lawrence Erlbaum Associates, New Jersey.

    Google Scholar 

  • Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11):1134–1142.

    Article  MATH  Google Scholar 

  • Watkin, T., Rau, A., and Biehl, M. (1993). The statistical mechanics of learning a rule. Reviews of Modern Physics, 65:499–556.

    Article  MathSciNet  Google Scholar 

  • Winston, P. H. (1975). Learning structural descriptions from examples. In Winston, P. H., editor, The Psychology of Computer Vision, pages 157–209. New York: McGraw-Hill.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Provost, F., Jensen, D., Oates, T. (2001). Progressive Sampling. In: Liu, H., Motoda, H. (eds) Instance Selection and Construction for Data Mining. The Springer International Series in Engineering and Computer Science, vol 608. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-3359-4_9

Download citation

  • DOI: https://doi.org/10.1007/978-1-4757-3359-4_9

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4419-4861-8

  • Online ISBN: 978-1-4757-3359-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics