Skip to main content

Pasting Small Votes for Classification in Large Databases and On-Line

Abstract

Many databases have grown to the point where they cannot fit into the fast memory of even large memory machines, to say nothing of current workstations. If what we want to do is to use these data bases to construct predictions of various characteristics, then since the usual methods require that all data be held in fast memory, various work-arounds have to be used. This paper studies one such class of methods which give accuracy comparable to that which could have been obtained if all data could have been held in core and which are computationally fast. The procedure takes small pieces of the data, grows a predictor on each small piece and then pastes these predictors together. A version is given that scales up to terabyte data sets. The methods are also applicable to on-line learning.

References

  • Breiman, L. (1996). Out-of-bag estimation, available at ftp.stat.berkeley.edu/pub/users/breiman/OOBestimation.ps.

  • Breiman, L. (1998). Arcing classifiers. Annals of Statistics, 26, 801–824.

    Google Scholar 

  • Breiman, L., Freidman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Wadsworth.

  • Breiman, L., & Spector, P. (1994). Parallelizing cart using a workstation network. Proceedings Annual American Statistical Association Meeting, San Francisco, available at ftp.stat.berkeley.edu/usrs/breiman/pcart.ps.Z.

  • Chan, P., & Stolfo, S. (1997a). Scalability of hierarchical meta-learning on partitioned data, submitted to the Journal of Data Mining and Knowledge Discovery, available at www.cs.fit.edu/»pkc/papers/dmkd-scale.ps.

  • Chan, P., & Stolfo, S. (1997b). On the accuracy of meta-learning for scalable data mining. Journal of Intelligent Information Systems, 9, 5–28.

    Google Scholar 

  • Drucker, H., & Cortes, C. (1996). Boosting decision trees. Neural information processing 8 (pp. 479–485). Morgan Kaufmann.

    Google Scholar 

  • Freund, Y., & Schapire, R. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. http://www.research.att.com/orgs/ssr/people/yoav or http://www.research.att.com/orgs/ ssr/people/schapire.

  • Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. Machine Learning: Proceedings of the Thirteenth International Conference (pp. 148–156).

  • Michie, D., Spiegelhalter, D., & Taylor, C. (1994). Machine learning, neural and statistical classification. London: Ellis Horwood.

    Google Scholar 

  • Provost, F.J., & Hennessey, D. (1996). Scaling up: Distributed machine learning with cooperation. Proceedings AAAI-96.

  • Provost, F.J., & Kolluri, V. (1997a). A survey of methods for scaling up inductive learning algorithms. Accepted by Data Mining and Knowledge Discovery Journal: Special Issue on Scalable High-Performance Computing for KDD available at www.pitt.edu/»uxkst/survey-paper.ps.

  • Provost, F.J., & Kolluri, V. (1997b). Scaling up inductive algorithms: An overview. Proc. of Knowledge Discovery in Databases (Vol. KDD'97, pp. 239–242).

    Google Scholar 

  • Quinlan, J.R. (1996). Bagging, boosting, and C4.5. Proceedings of AAAI'96 National Conference (Vol. 1, pp. 725–730).

    Google Scholar 

  • Schapire, R.E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–226.

    Google Scholar 

  • Shafer, J., Agrawal, R., & Mehta, M. (1996). SPRINT: A scalable parallel classifier for data mining. Proceedings of the 22nd VLDB Conference (pp. 544–555).

  • Tibshirani, R. (1996). Bias, variance, and prediction error for classification rules (Technical Report). Statistics Department, University of Toronto.

  • Utgoff, P. (1989). Incremental induction of decision trees. Machine Learning, 4, 161–186.

    Google Scholar 

  • Wolpert, D.H., & Macready,W.G. (1996). An efficient method to estimate bagging's generalization error. Machine Learning, to appear.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Breiman, L. Pasting Small Votes for Classification in Large Databases and On-Line. Machine Learning 36, 85–103 (1999). https://doi.org/10.1023/A:1007563306331

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1007563306331

  • combining
  • database
  • votes
  • pasting