SLIQ: A fast scalable classifier for data mining
Classification is an important problem in the emerging field of data mining. Although classification has been studied extensively in the past, most of the classification algorithms are designed only for memory-resident data, thus limiting their suitability for data mining large data sets. This paper discusses issues in building a scalable classifier and presents the design of SLIQ, a new classifier. SLIQ is a decision tree classifier that can handle both numeric and categorical attributes. It uses a novel pre-sorting technique in the tree-growth phase. This sorting procedure is integrated with a breadth-first tree growing strategy to enable classification of disk-resident datasets. SLIQ also uses a new tree-pruning algorithm that is inexpensive, and results in compact and accurate trees. The combination of these techniques enables SLIQ to scale for large data sets and classify data sets irrespective of the number of classes, attributes, and examples (records), thus making it an attractive tool for data mining.
Unable to display preview. Download preview PDF.
- 1.R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective. IEEE Trans. on Knowledge and Data Engineering, 5(6), Dec. 1993.Google Scholar
- 2.J. Catlett. Megainduction: Machine Learning on Very Large Databases. PhD thesis, University of Sydney, 1991.Google Scholar
- 3.P. K. Chan and S. J. Stolfo. Meta-learning for multistrategy and parallel learning. In Proc. Second Intl. Workshop on Multistrategy Learning, pages 150–165, 1993.Google Scholar
- 4.L. Breiman et. al. Classification and Regression Trees. Wadsworth, Belmont, 1984.Google Scholar
- 5.R. Agrawal et. al. An interval classifier for database mining applications. In Proc. of the VLDB Conf., Vancouver, British Columbia, Canada, August 1992.Google Scholar
- 6.M. Mehta, J. Rissanen, and R. Agrawal. MDL-based decision tree pruning. In Int'l Conf. on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, Aug. 1995.Google Scholar
- 7.D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994.Google Scholar
- 8.NASA Ames Res. Ctr. Intro. to IND Version 2.1, GA23-2475-02 edition, 1992.Google Scholar
- 9.J. R. Quinlan and R. L. Rivest. Inferring decision trees using minimum description length principle. Information and Computation, 1989.Google Scholar
- 10.J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.Google Scholar
- 11.J. Rissanen. Stochastic Complexity in Statistical Inquiry. World Scientific Publ. Co., 1989.Google Scholar
- 12.C. Wallace and J. Patrick. Coding decision trees. Machine Learning, 11:7–22, 1993.Google Scholar
- 13.S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991.Google Scholar