Abstract
An effective input dataset, valid pattern-spotting ability, good discovered pattern evaluation is required in order to analyze, predict and discover previously unknown knowledge from a large data set. The criteria of significance, novelty and usefulness need to be fulfilled in order to evaluate the performance of the prediction and classification of data. Thankfully data mining, an important step in this process of knowledge discovery extract hidden and non-trivial information from raw data through useful methods such as decision tree classification. But due to the enormous size, high-dimensionality and heterogeneous nature of the data sets, the traditional decision tree classification algorithms sometimes do not perform well in terms of computation time. This paper proposes a framework that uses a parallel strategy to optimize the performance of decision tree induction and cross-validation in order to classify data. Moreover, an existing pruning method is incorporated with our framework to overcome the overfitting problem and enhancing generalization ability along with reducing cost and structural complexity. Experiments on ten benchmark data sets suggest significant improvement in computation time and better classification accuracy by optimizing the classification framework.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
An, A., Cercone, N.J.: Discretization of Continuous Attributes for Learning Classification Rules. In: Zhong, N., Zhou, L. (eds.) PAKDD 1999. LNCS (LNAI), vol. 1574, pp. 509–514. Springer, Heidelberg (1999)
Blockeel, H., Struyf, J.: Efficient algorithms for decision tree cross-validation. In: Proceedings of ICML 2001- Eighteenth International Conference on Machine Learning, pp. 11–18. Morgan Kaufmann (2001)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Statistics/Probability Series. Wadsworth Publishing Company, Belmont (1984)
Chen, M.-S., Hun, J., Yu, P.S., Ibm, T.J., Ctr, W.R.: Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Engineering 8, 866–883 (1996)
Chung, H., Gray, P.: Data mining. Journal of Management Information Systems 16(1), 11–16 (1999)
Goil, S., Aluru, S., Ranka, S.: Concatenated parallelism: A technique for efficient parallel divide and conquer. In: Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP 1996), pp. 488–495. IEEE Computer Society, Washington, DC (1996)
Joshi, M.V., Karypis, G., Kumar, V.: Scalparc: A new scalable and efficient parallel classification algorithm for mining large datasets. In: Proceedings of the 12th International Parallel Processing Symposium on International Parallel Processing Symposium, IPPS 1998, pp. 573–579. IEEE Computer Society, Washington, DC (1998)
Senthamarai Kannan, K., Sailapathi Sekar, P., Mohamed Sathik, M., Arumugam, P.: Financial stock market forecast using data mining techniques. In: Proceedings of the International MultiConference of Engineers and Computer Scientists (IMECS 1996), Hong Kong, March 17-19, pp. 555–559 (2010)
Koh, H.C., Tan, G.: Data mining applications in healthcare. Journal of Healthcare Information Management 19(2), 64–72 (2005)
Kreuze, D.: Debugging hospitals. Technology Review 104(2), 32 (2001)
Kumar, V., Grama, A., Gupta, A., Karypis, G.: Introduction to parallel computing: design and analysis of algorithms. Benjamin-Cummings Publishing Co., Inc., Redwood City (1994)
Lipschutz, S.: Schaum’s Outline of Theory and Problems of Data Structures. McGraw-Hill, Redwood City (1986)
Liu, X., Wang, G., Qiao, B., Han, D.: Parallel strategies for training decision tree. Computer Science J. 31, 129–135 (2004)
Madai, B., AlShaikh, R.: Performance modeling and mpi evaluation using westmere-based infiniband hpc cluster. In: Proceedings of the 2010 Fourth UKSim European Symposium on Computer Modeling and Simulation, Washington, DC, USA, pp. 363–368 (2010)
Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A Fast Scalable Classifier for Data Mining. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 18–32. Springer, Heidelberg (1996)
Milley, A.: Healthcare and data mining. Health Management Technology 21(8), 44–47 (2000)
Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)
Quinlan, J.R.: C4.5: Programs for Machine Learning, 1st edn. Morgan Kaufmann, San Mateo (1992)
Quinlan, J.R.: Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research 4, 77–90 (1996)
Shafer, J., Agrawal, R., Mehta, M.: SPRINT: A scalable parallel classifier for data mining. In: VLDB, pp. 544–555 (1996)
Srivastava, A., Han, E., Kumar, V., Singh, V.: Parallel formulations of decision-tree classification algorithms. Data Mining and Knowledge Discovery: An International Journal, 237–261 (1998)
Trybula, W.J.: Data mining and knowledge discovery. Annual Review of Information Science and Technology (ARIST) 32, 197–229 (1997)
Wei, J.M., Wang, S.Q., Yu, G., Gu, L., Wang, G.Y., Yuan, X.J.: A novel method for pruning decision trees. In: Proceedings of 8th International Conference on Machine Learning and Cybernetics, July 12-15, pp. 339–343 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Alam, F.I., Bappee, F.K., Rabbani, M.R., Islam, M.M. (2013). An Optimized Formulation of Decision Tree Classifier. In: Unnikrishnan, S., Surve, S., Bhoir, D. (eds) Advances in Computing, Communication, and Control. ICAC3 2013. Communications in Computer and Information Science, vol 361. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36321-4_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-36321-4_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36320-7
Online ISBN: 978-3-642-36321-4
eBook Packages: Computer ScienceComputer Science (R0)