Abstract
The advancement of new technologies in today’s era produces a vast amount of data. To store, analyze and mine knowledge from huge data requires large space as well as better execution speed. To train classifiers using a large amount of data requires more time and space. To avoid wastage of time and space, there is a need to mine significant information from a huge collection of data. Decision tree is one of the promising classifiers which mine knowledge from huge data. This paper aims to reduce the data to construct efficient decision tree classifier. This paper presents a method which finds informative data to improve the performance of decision tree classifier. Two clustering-based methods are proposed for dimensionality reduction and utilizing knowledge from outliers. These condensed data are applied to the decision tree for high prediction accuracy. The uniqueness of the first method is that it finds the representative instances from clusters that utilize knowledge of its neighboring data. The second method uses supervised clustering which finds the number of cluster representatives for the reduction of data. With an increase in the prediction accuracy of a tree, these methods decrease the size, building time and space required for decision tree classifiers. These novel methods are united into a single supervised and unsupervised Decision Tree based on Cluster Analysis Pre-processing (DTCAP) which hunts the informative instances from a small, medium and large dataset. The experiments are conducted on a standard UCI dataset of different sizes. It illustrates that the method with its simplicity performs a reduction of data up to 50%. It produces a qualitative dataset which enhances the performance of the decision tree classifier.
Similar content being viewed by others
References
Alvar AS, Abadeh MS (2016) Efficient instance selection algorithm for classification based on fuzzy frequent patterns. In: 2016 IEEE 17th international symposium on computational intelligence and informatics (CINTI), pp 000319–000324
Angiulli F (2005) Fast condensed nearest neighbor rule. In: Proceedings of the 22nd international conference on machine learning, pp 25–32
Bailey K (1994) Numerical taxonomy and cluster analysis. In: Typologies and Taxonomies, vol 34, pp 24
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. CRC Press, Cambridge
Cavalcanti GD, Ren TI, Pereira CL (2013) ATISA: adaptive threshold-based instance selection algorithm. Expert Syst Appl 40(17):6894–6900
Chao S, Chen L (2005) Feature dimension reduction for microarray data analysis using locally linear embedding. In: Proceedings of the 3rd Asia-Pacific bioinformatics conference, vol 1, pp 211–217
Chen G, Cheng Y, Xu J (2008) Cluster reduction support vector machine for large-scale data set classification. In: 2008 IEEE Pacific-Asia workshop on computational intelligence and industrial application, vol 1, pp 8–12
Chou CH, Kuo BH, Chang F (2006) The generalized condensed nearest neighbor rule as a data reduction method. In: 18th international conference on pattern recognition (ICPR’06), vol 2. IEEE, pp 556–559
Czarnowski I (2012) Cluster-based instance selection for machine classification. Knowl Inf Syst 30(1):113–133
Dua D and Graff C (2019) UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml
Gates G (1972) The reduced nearest neighbor rule (Corresp.). IEEE Trans Inf Theory 18(3):431–433
Han J, Kamber M (2006) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers, Burlington, pp 223–357
Hart P (1968) The condensed nearest neighbor rule (Corresp.). IEEE Trans Inf Theory 14(3):515–516
Hernandez-Lea P, Carrasco-Ochoa JA, Martínez-Trinidad JF, Olvera-Lopez JA (2013) InstanceRank based on borders for instance selection. Pattern Recognit 46(1):365–375
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–66
Kass GV (1980) An exploratory technique for investigating large quantities of categorical data. Appl Stat 29:119–127
Lumini A, Nanni L (2006) A clustering method for automatic biometric template selection. Pattern Recognit 39:495–497
Marchiori E (2008) Hit miss networks with applications to instance selection. J Mach Learn Res 9:997–1017
Nikolaidis K, Goulermas JY, Wu QH (2011) A class boundary preserving algorithm for data condensation. Pattern Recognit 44(3):704–715
Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF (2010) A new fast prototype selection method based on clustering. Pattern Anal Appl 13(2):131–141
Ougiaroglou S, Evangelidis G (2012) Efficient dataset size reduction by finding homogeneous clusters. In: Proceedings of the fifth Balkan conference in informatics. ACM, pp 168–173
Pechenizkiy M, Tsymbal A, Puuronen S (2006) Local dimensionality reduction and supervised learning within natural clusters for biomedical data analysis. IEEE Trans Inf Technol Biomed 10(3):533–539
Peng K, Leung VC, Huang Q (2018) Clustering approach based on mini batch kmeans for intrusion detection system over big data. IEEE Access 6:11897–11906
Phinyomark A, Pornchai P, Chusak L (2012) Feature reduction and selection for EMG signal classification. Expert Syst Appl 39(8):7420–7431
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):119–127
Quinlan JR (1993) Programming for machine Learning. Morgan Kaufman, San Francisco
Randall WD, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38(3):257–286
Sanguinetti G (2008) Dimensionality reduction of clustered data sets. IEEE Trans Pattern Anal Mach Intell 30(3):535–540
Sathyadevan S, Nair RR (2015) Comparative analysis of decision tree algorithms: ID3, C4.5 and random forest. In: Computational intelligence in data mining, vol 1. Springer, New Delhi, pp 549–562
Tang T, Chen S, Zhao M, Huang W, Luo J (2019) Very large-scale data classification based on K-means clustering and multi-kernel SVM. Soft Comput 23(11):3793–3801
Thorndike RL (1953) Who belongs in the family? Psychometrika 18:267–276. https://doi.org/10.1007/BF02289263
Toussaint GT, Foulsen RS (1979) Some new algorithms and software implementation methods for pattern recognition research. In: COMPSAC 79. Proceedings. Computer software and The IEEE computer society’s third international applications conference, pp 59–63
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybernet 2(3):408–421
Yodjaiphet A, Theera-Umpon N, Auephanwiriyakul S (2015) Instance reduction for supervised learning using input-output clustering method. J Cent South Univ 22(12):4740–4748
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Panhalkar, A.R., Doye, D.D. An approach of improving decision tree classifier using condensed informative data. Decision 47, 431–445 (2020). https://doi.org/10.1007/s40622-020-00265-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40622-020-00265-3