Abstract
Bagging and boosting are two popular ensemble methods that achieve better accuracy than a single classifier. These techniques have limitations on massive datasets, as the size of the dataset can be a bottleneck. Voting many classifiers built on small subsets of data (“pasting small votes”) is a promising approach for learning from massive datasets. Pasting small votes can utilize the power of boosting and bagging, and potentially scale up to massive datasets. We propose a framework for building hundreds or thousands of such classifiers on small subsets of data in a distributed environment. Experiments show this approach is fast, accurate, and scalable to massive datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, Vol 36, 105–139. Kluwer (1999).
Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html, University of California, Irvine, Dept. of Information and Computer Sciences (1998).
Breiman, L. Bagging predictors. Machine Learning, Vol 24. Kluwer (1996) 123–140.
Breiman, L.: Pasting small votes for classification in large databases and on-line. Machine Learning, Vol 36. Kluwer (1999) 85–103.
Chan, P., Stolfo, S.: Towards parallel and distributed learning by meta-learning. AAAI Workshop on Knowledge Discovery and Databases. (1993) 227–240.
Chawla, N., Eschrich, S., Hall, L.O.; Creating ensembles of classifiers. First IEEE International Conference on Data Mining. (2000).
Chawla, N.V., Moore, T.E., Bowyer, K.W., Hall, L.O., Springer, C., Kegelmeyer, W.P.: Bagging is a small dataset phenomenon. International Conference of Computer Vision and Pattern Recognition (CVPR). (2000) 684–689.
Dietterich, T.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, Vol 40. Kluwer (2000) 139–158.
Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. Thirteenth International Conference on Machine Learning. (1996).
Hall, L.O., Chawla, N.V., Bowyer, K.W., Kegelmeyer, W.P.: Learning rules from distributed data. Workshop of Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (1999).
Jones, D.: Protein secondary structure prediction based on decision-specific scoring matrices. Journal of Molecular Biology, Vol 292. (1999) 195–202.
Latinne, P., Debeir, O., Decaestecker, C.: Different ways of weakening decision trees and their impact on classification accuracy of DT combination. First International Workshop on Multiple Classifier Systems. Lecture Notes in Computer Science, Vol 1857. Springer-Verlag, (2000) 200–210.
Lazarevic, A., Obradovic, Z.: The distributed boosting algorithm. Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2000).
Musick, R., Catlett, J., Russell, S.. Decision theoretic subsampling for induction on large databases. Tenth International Conference on Machine Learning, Amherst, MA. (1993) 212–219.
Provost, F.J., Hennessy D.N.: Scaling up: Distributed machine learning with cooperation. Thirteenth National Conference on Artificial Intelligence. (1996) 74–79.
Protein Data Bank. http://www.rcsb.org/pdb/
Provost, F., Jensen D., Oates, T.: Efficient progressive sampling. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (1999) 23–32.
Quinlan, J.R.: C4.5: Programs for machine learning. Morgan Kaufman San Mateo, CA (1992).
Sandia National Labs.: ASCI RED, the world’s first TeraFLOPS supercomputer. http://www.sandia.gov/ASCI/Red.
Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. http://www.acm.org/sigkdd/kdd2001/ (2001).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chawla, N.V., Hall, L.O., Bowyer, K.W., Moore, T.E., Kegelmeyer, W.P. (2002). Distributed Pasting of Small Votes. In: Roli, F., Kittler, J. (eds) Multiple Classifier Systems. MCS 2002. Lecture Notes in Computer Science, vol 2364. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45428-4_5
Download citation
DOI: https://doi.org/10.1007/3-540-45428-4_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43818-2
Online ISBN: 978-3-540-45428-1
eBook Packages: Springer Book Archive