Abstract
We propose a fast and efficient sampling strategy to build decision trees from a very large database, even when there are many continuous attributes which must be discretized at each step. Successive samples are used, one on each tree node. After a brief description of two fast sequential simple random sampling methods, we apply elements of statistical theory in order to determine the sample size that is sufficient at each step to obtain a decision tree as efficient as one built on the whole database. Applying the method to a simulated database (virtually infinite size), and to five usual benchmarks, confirms that when the database is large and contains many numerical attributes, our strategy of fast sampling on each node (with sample size about n = 300 or 500) speed up the mining process while maintaining the accuracy of the classifier.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agresti, A. (1990). Categorical data analysis. John Wiley, New York.
Bay, S. (1999). The uci kdd archive [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Department of Computer Science.
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. California : Wadsworth International.
Chauchat, J., Boussaid, O., and Amoura, L. (1998). Optimization sampling in a large database for induction trees. In Proceedings of the JCIS ’98-Association for Intelligent Machinery, pages 28–31.
Chauchat, J., Rakotomalala, R., and Robert, D. (2000). Sampling strategies for targeting rare groups from a bank customer database. In (To appear) Proceedings of the Fourth European Conference PKDD ’2000, Lyon, France. Springer Verlag.
Dietterich, T. and Kong, E. (1995). Machine learning bias, statistical bias and statistical variance of decision trees algorithms. In Proceedings of the 12 th International Conference on Machine Learning.
Dougherty, J., Kohavi, R., and Sahami, M. (1995). Supervised and unsupervised discretization of continuous attributes. In Kaufmann, M., editor, Machine Learning : Proceedings of the 12 th International Conference (ICML-95), pages 194–202.
Fan, C., Muller, M., and Rezucha, I. (1962). Development of sampling plans using sequential (item by item) selection techniques and digital computers. Journal of American Statistics Association, 57:387–402.
Fayyad, U. and Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of The 13 th Int. Joint Conf. on Artificial Intelligence , pages 1022–1027. Morgan Kaufmann.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). Knowledge discovey and data mining : Towards an unifying framework. In Proceedings of the 2 nd International Conference on Knowledge Discovery and Data Mining.
Frank, E. and Witten, I. H. (1999). Making better use of global discretization. In Proc. 16th International Conf. on Machine Learning, pages 115–123. Morgan Kaufmann, San Francisco, CA.
Johnson, N. and Kotz, S. (1969). Continuous Univariate Distributions, volume 2. Wiley.
Kass, G. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29(2):119–127.
Kivinen, J. and Manilla, H. (1994). The power of sampling in knowledge discovery. In Proceedings of 13th SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, volume 13, pages 77–85.
Michie, D., Spiegelhalter, D., and Taylor, C. (1994). Machine learning, neural and statistical classification. Ellis Horwood, London.
Mitchell, T. (1997). Machine learning. McGraw Hill.
Quinlan, J. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.
Salzberg, S. (1999). On comparing classifiers : A critique of current research and methods. Data mining and Knowledge discovery, 1:1–12.
Sollich, P. and Krogh, A. (1996). Learning with ensembles : How over-fitting can be usefull. In Touretzky, D., Mozer, M., and Hasselmo, M., editors, Advances in Neural Information Processing Systems, pages 190–196. MIT press.
Vitter, J. (1987). An efficient algorithm for sequential random sampling. ACM Transactions on Mathematical Software, 13(1):58–67.
Witten, I. and Frank, E. (2000). Data Mining: practical machine learning tools and techniques with JAVA implementations. Morgan Kaufmann.
Zighed, D., Rabaseda, S., Rakotomalala, R., and Feschet, F. (1999). Discretization methods in supervised learning. In Kent, A. and Williams, J., editors, Encyclopedia of Computer Science and Technology, volume 40, pages 35–50. Marcel Dekker, Inc.
Zighed, D. and Rakotomalala, R. (2000). Graphes d’Induction — Apprentissage et Data Mining. Hermes.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Chauchat, JH., Rakotomalala, R. (2001). Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes. In: Liu, H., Motoda, H. (eds) Instance Selection and Construction for Data Mining. The Springer International Series in Engineering and Computer Science, vol 608. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-3359-4_10
Download citation
DOI: https://doi.org/10.1007/978-1-4757-3359-4_10
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-4861-8
Online ISBN: 978-1-4757-3359-4
eBook Packages: Springer Book Archive