Skip to main content

Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes

  • Chapter
Instance Selection and Construction for Data Mining

Abstract

We propose a fast and efficient sampling strategy to build decision trees from a very large database, even when there are many continuous attributes which must be discretized at each step. Successive samples are used, one on each tree node. After a brief description of two fast sequential simple random sampling methods, we apply elements of statistical theory in order to determine the sample size that is sufficient at each step to obtain a decision tree as efficient as one built on the whole database. Applying the method to a simulated database (virtually infinite size), and to five usual benchmarks, confirms that when the database is large and contains many numerical attributes, our strategy of fast sampling on each node (with sample size about n = 300 or 500) speed up the mining process while maintaining the accuracy of the classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Agresti, A. (1990). Categorical data analysis. John Wiley, New York.

    MATH  Google Scholar 

  • Bay, S. (1999). The uci kdd archive [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Department of Computer Science.

    Google Scholar 

  • Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. California : Wadsworth International.

    MATH  Google Scholar 

  • Chauchat, J., Boussaid, O., and Amoura, L. (1998). Optimization sampling in a large database for induction trees. In Proceedings of the JCIS ’98-Association for Intelligent Machinery, pages 28–31.

    Google Scholar 

  • Chauchat, J., Rakotomalala, R., and Robert, D. (2000). Sampling strategies for targeting rare groups from a bank customer database. In (To appear) Proceedings of the Fourth European Conference PKDD ’2000, Lyon, France. Springer Verlag.

    Google Scholar 

  • Dietterich, T. and Kong, E. (1995). Machine learning bias, statistical bias and statistical variance of decision trees algorithms. In Proceedings of the 12 th International Conference on Machine Learning.

    Google Scholar 

  • Dougherty, J., Kohavi, R., and Sahami, M. (1995). Supervised and unsupervised discretization of continuous attributes. In Kaufmann, M., editor, Machine Learning : Proceedings of the 12 th International Conference (ICML-95), pages 194–202.

    Google Scholar 

  • Fan, C., Muller, M., and Rezucha, I. (1962). Development of sampling plans using sequential (item by item) selection techniques and digital computers. Journal of American Statistics Association, 57:387–402.

    Article  MathSciNet  MATH  Google Scholar 

  • Fayyad, U. and Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of The 13 th Int. Joint Conf. on Artificial Intelligence , pages 1022–1027. Morgan Kaufmann.

    Google Scholar 

  • Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). Knowledge discovey and data mining : Towards an unifying framework. In Proceedings of the 2 nd International Conference on Knowledge Discovery and Data Mining.

    Google Scholar 

  • Frank, E. and Witten, I. H. (1999). Making better use of global discretization. In Proc. 16th International Conf. on Machine Learning, pages 115–123. Morgan Kaufmann, San Francisco, CA.

    Google Scholar 

  • Johnson, N. and Kotz, S. (1969). Continuous Univariate Distributions, volume 2. Wiley.

    Google Scholar 

  • Kass, G. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29(2):119–127.

    Article  Google Scholar 

  • Kivinen, J. and Manilla, H. (1994). The power of sampling in knowledge discovery. In Proceedings of 13th SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, volume 13, pages 77–85.

    Google Scholar 

  • Michie, D., Spiegelhalter, D., and Taylor, C. (1994). Machine learning, neural and statistical classification. Ellis Horwood, London.

    MATH  Google Scholar 

  • Mitchell, T. (1997). Machine learning. McGraw Hill.

    MATH  Google Scholar 

  • Quinlan, J. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.

    Google Scholar 

  • Salzberg, S. (1999). On comparing classifiers : A critique of current research and methods. Data mining and Knowledge discovery, 1:1–12.

    Google Scholar 

  • Sollich, P. and Krogh, A. (1996). Learning with ensembles : How over-fitting can be usefull. In Touretzky, D., Mozer, M., and Hasselmo, M., editors, Advances in Neural Information Processing Systems, pages 190–196. MIT press.

    Google Scholar 

  • Vitter, J. (1987). An efficient algorithm for sequential random sampling. ACM Transactions on Mathematical Software, 13(1):58–67.

    Article  MathSciNet  Google Scholar 

  • Witten, I. and Frank, E. (2000). Data Mining: practical machine learning tools and techniques with JAVA implementations. Morgan Kaufmann.

    Google Scholar 

  • Zighed, D., Rabaseda, S., Rakotomalala, R., and Feschet, F. (1999). Discretization methods in supervised learning. In Kent, A. and Williams, J., editors, Encyclopedia of Computer Science and Technology, volume 40, pages 35–50. Marcel Dekker, Inc.

    Google Scholar 

  • Zighed, D. and Rakotomalala, R. (2000). Graphes d’Induction — Apprentissage et Data Mining. Hermes.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Chauchat, JH., Rakotomalala, R. (2001). Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes. In: Liu, H., Motoda, H. (eds) Instance Selection and Construction for Data Mining. The Springer International Series in Engineering and Computer Science, vol 608. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-3359-4_10

Download citation

  • DOI: https://doi.org/10.1007/978-1-4757-3359-4_10

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4419-4861-8

  • Online ISBN: 978-1-4757-3359-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics