Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes

Chauchat, Jean-Hugues; Rakotomalala, Ricco

doi:10.1007/978-1-4757-3359-4_10

Jean-Hugues Chauchat³ &
Ricco Rakotomalala³

Part of the book series: The Springer International Series in Engineering and Computer Science ((SECS,volume 608))

293 Accesses
2 Citations

Abstract

We propose a fast and efficient sampling strategy to build decision trees from a very large database, even when there are many continuous attributes which must be discretized at each step. Successive samples are used, one on each tree node. After a brief description of two fast sequential simple random sampling methods, we apply elements of statistical theory in order to determine the sample size that is sufficient at each step to obtain a decision tree as efficient as one built on the whole database. Applying the method to a simulated database (virtually infinite size), and to five usual benchmarks, confirms that when the database is large and contains many numerical attributes, our strategy of fast sampling on each node (with sample size about n = 300 or 500) speed up the mining process while maintaining the accuracy of the classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agresti, A. (1990). Categorical data analysis. John Wiley, New York.
MATH Google Scholar
Bay, S. (1999). The uci kdd archive [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Department of Computer Science.
Google Scholar
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. California : Wadsworth International.
MATH Google Scholar
Chauchat, J., Boussaid, O., and Amoura, L. (1998). Optimization sampling in a large database for induction trees. In Proceedings of the JCIS ’98-Association for Intelligent Machinery, pages 28–31.
Google Scholar
Chauchat, J., Rakotomalala, R., and Robert, D. (2000). Sampling strategies for targeting rare groups from a bank customer database. In (To appear) Proceedings of the Fourth European Conference PKDD ’2000, Lyon, France. Springer Verlag.
Google Scholar
Dietterich, T. and Kong, E. (1995). Machine learning bias, statistical bias and statistical variance of decision trees algorithms. In Proceedings of the 12 ^th International Conference on Machine Learning.
Google Scholar
Dougherty, J., Kohavi, R., and Sahami, M. (1995). Supervised and unsupervised discretization of continuous attributes. In Kaufmann, M., editor, Machine Learning : Proceedings of the 12 ^th International Conference (ICML-95), pages 194–202.
Google Scholar
Fan, C., Muller, M., and Rezucha, I. (1962). Development of sampling plans using sequential (item by item) selection techniques and digital computers. Journal of American Statistics Association, 57:387–402.
Article MathSciNet MATH Google Scholar
Fayyad, U. and Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of The 13 ^th Int. Joint Conf. on Artificial Intelligence , pages 1022–1027. Morgan Kaufmann.
Google Scholar
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). Knowledge discovey and data mining : Towards an unifying framework. In Proceedings of the 2 ^nd International Conference on Knowledge Discovery and Data Mining.
Google Scholar
Frank, E. and Witten, I. H. (1999). Making better use of global discretization. In Proc. 16th International Conf. on Machine Learning, pages 115–123. Morgan Kaufmann, San Francisco, CA.
Google Scholar
Johnson, N. and Kotz, S. (1969). Continuous Univariate Distributions, volume 2. Wiley.
Google Scholar
Kass, G. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29(2):119–127.
Article Google Scholar
Kivinen, J. and Manilla, H. (1994). The power of sampling in knowledge discovery. In Proceedings of 13th SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, volume 13, pages 77–85.
Google Scholar
Michie, D., Spiegelhalter, D., and Taylor, C. (1994). Machine learning, neural and statistical classification. Ellis Horwood, London.
MATH Google Scholar
Mitchell, T. (1997). Machine learning. McGraw Hill.
MATH Google Scholar
Quinlan, J. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.
Google Scholar
Salzberg, S. (1999). On comparing classifiers : A critique of current research and methods. Data mining and Knowledge discovery, 1:1–12.
Google Scholar
Sollich, P. and Krogh, A. (1996). Learning with ensembles : How over-fitting can be usefull. In Touretzky, D., Mozer, M., and Hasselmo, M., editors, Advances in Neural Information Processing Systems, pages 190–196. MIT press.
Google Scholar
Vitter, J. (1987). An efficient algorithm for sequential random sampling. ACM Transactions on Mathematical Software, 13(1):58–67.
Article MathSciNet Google Scholar
Witten, I. and Frank, E. (2000). Data Mining: practical machine learning tools and techniques with JAVA implementations. Morgan Kaufmann.
Google Scholar
Zighed, D., Rabaseda, S., Rakotomalala, R., and Feschet, F. (1999). Discretization methods in supervised learning. In Kent, A. and Williams, J., editors, Encyclopedia of Computer Science and Technology, volume 40, pages 35–50. Marcel Dekker, Inc.
Google Scholar
Zighed, D. and Rakotomalala, R. (2000). Graphes d’Induction — Apprentissage et Data Mining. Hermes.
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory ERIC, University Lumiére Lyon, 5, av. Pierre Mendes-France, F-69676, Bron, France
Jean-Hugues Chauchat & Ricco Rakotomalala

Authors

Jean-Hugues Chauchat
View author publications
You can also search for this author in PubMed Google Scholar
Ricco Rakotomalala
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Arizona State University, USA
Huan Liu
Osaka University, Japan
Hiroshi Motoda

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chauchat, JH., Rakotomalala, R. (2001). Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes. In: Liu, H., Motoda, H. (eds) Instance Selection and Construction for Data Mining. The Springer International Series in Engineering and Computer Science, vol 608. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-3359-4_10

Download citation

DOI: https://doi.org/10.1007/978-1-4757-3359-4_10
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-4861-8
Online ISBN: 978-1-4757-3359-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics