Skip to main content
Log in

An approach of improving decision tree classifier using condensed informative data

  • Research Article
  • Published:
DECISION Aims and scope Submit manuscript

Abstract

The advancement of new technologies in today’s era produces a vast amount of data. To store, analyze and mine knowledge from huge data requires large space as well as better execution speed. To train classifiers using a large amount of data requires more time and space. To avoid wastage of time and space, there is a need to mine significant information from a huge collection of data. Decision tree is one of the promising classifiers which mine knowledge from huge data. This paper aims to reduce the data to construct efficient decision tree classifier. This paper presents a method which finds informative data to improve the performance of decision tree classifier. Two clustering-based methods are proposed for dimensionality reduction and utilizing knowledge from outliers. These condensed data are applied to the decision tree for high prediction accuracy. The uniqueness of the first method is that it finds the representative instances from clusters that utilize knowledge of its neighboring data. The second method uses supervised clustering which finds the number of cluster representatives for the reduction of data. With an increase in the prediction accuracy of a tree, these methods decrease the size, building time and space required for decision tree classifiers. These novel methods are united into a single supervised and unsupervised Decision Tree based on Cluster Analysis Pre-processing (DTCAP) which hunts the informative instances from a small, medium and large dataset. The experiments are conducted on a standard UCI dataset of different sizes. It illustrates that the method with its simplicity performs a reduction of data up to 50%. It produces a qualitative dataset which enhances the performance of the decision tree classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Alvar AS, Abadeh MS (2016) Efficient instance selection algorithm for classification based on fuzzy frequent patterns. In: 2016 IEEE 17th international symposium on computational intelligence and informatics (CINTI), pp 000319–000324

  • Angiulli F (2005) Fast condensed nearest neighbor rule. In: Proceedings of the 22nd international conference on machine learning, pp 25–32

  • Bailey K (1994) Numerical taxonomy and cluster analysis. In: Typologies and Taxonomies, vol 34, pp 24

  • Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. CRC Press, Cambridge

    Google Scholar 

  • Cavalcanti GD, Ren TI, Pereira CL (2013) ATISA: adaptive threshold-based instance selection algorithm. Expert Syst Appl 40(17):6894–6900

    Article  Google Scholar 

  • Chao S, Chen L (2005) Feature dimension reduction for microarray data analysis using locally linear embedding. In: Proceedings of the 3rd Asia-Pacific bioinformatics conference, vol 1, pp 211–217

  • Chen G, Cheng Y, Xu J (2008) Cluster reduction support vector machine for large-scale data set classification. In: 2008 IEEE Pacific-Asia workshop on computational intelligence and industrial application, vol 1, pp 8–12

  • Chou CH, Kuo BH, Chang F (2006) The generalized condensed nearest neighbor rule as a data reduction method. In: 18th international conference on pattern recognition (ICPR’06), vol 2. IEEE, pp 556–559

  • Czarnowski I (2012) Cluster-based instance selection for machine classification. Knowl Inf Syst 30(1):113–133

    Article  Google Scholar 

  • Dua D and Graff C (2019) UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml

  • Gates G (1972) The reduced nearest neighbor rule (Corresp.). IEEE Trans Inf Theory 18(3):431–433

    Article  Google Scholar 

  • Han J, Kamber M (2006) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers, Burlington, pp 223–357

    Google Scholar 

  • Hart P (1968) The condensed nearest neighbor rule (Corresp.). IEEE Trans Inf Theory 14(3):515–516

    Article  Google Scholar 

  • Hernandez-Lea P, Carrasco-Ochoa JA, Martínez-Trinidad JF, Olvera-Lopez JA (2013) InstanceRank based on borders for instance selection. Pattern Recognit 46(1):365–375

    Article  Google Scholar 

  • Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–66

    Article  Google Scholar 

  • Kass GV (1980) An exploratory technique for investigating large quantities of categorical data. Appl Stat 29:119–127

    Article  Google Scholar 

  • Lumini A, Nanni L (2006) A clustering method for automatic biometric template selection. Pattern Recognit 39:495–497

    Article  Google Scholar 

  • Marchiori E (2008) Hit miss networks with applications to instance selection. J Mach Learn Res 9:997–1017

    Google Scholar 

  • Nikolaidis K, Goulermas JY, Wu QH (2011) A class boundary preserving algorithm for data condensation. Pattern Recognit 44(3):704–715

    Article  Google Scholar 

  • Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF (2010) A new fast prototype selection method based on clustering. Pattern Anal Appl 13(2):131–141

    Article  Google Scholar 

  • Ougiaroglou S, Evangelidis G (2012) Efficient dataset size reduction by finding homogeneous clusters. In: Proceedings of the fifth Balkan conference in informatics. ACM, pp 168–173

  • Pechenizkiy M, Tsymbal A, Puuronen S (2006) Local dimensionality reduction and supervised learning within natural clusters for biomedical data analysis. IEEE Trans Inf Technol Biomed 10(3):533–539

    Article  Google Scholar 

  • Peng K, Leung VC, Huang Q (2018) Clustering approach based on mini batch kmeans for intrusion detection system over big data. IEEE Access 6:11897–11906

    Article  Google Scholar 

  • Phinyomark A, Pornchai P, Chusak L (2012) Feature reduction and selection for EMG signal classification. Expert Syst Appl 39(8):7420–7431

    Article  Google Scholar 

  • Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):119–127

    Google Scholar 

  • Quinlan JR (1993) Programming for machine Learning. Morgan Kaufman, San Francisco

    Google Scholar 

  • Randall WD, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38(3):257–286

    Article  Google Scholar 

  • Sanguinetti G (2008) Dimensionality reduction of clustered data sets. IEEE Trans Pattern Anal Mach Intell 30(3):535–540

    Article  Google Scholar 

  • Sathyadevan S, Nair RR (2015) Comparative analysis of decision tree algorithms: ID3, C4.5 and random forest. In: Computational intelligence in data mining, vol 1. Springer, New Delhi, pp 549–562

  • Tang T, Chen S, Zhao M, Huang W, Luo J (2019) Very large-scale data classification based on K-means clustering and multi-kernel SVM. Soft Comput 23(11):3793–3801

    Article  Google Scholar 

  • Thorndike RL (1953) Who belongs in the family? Psychometrika 18:267–276. https://doi.org/10.1007/BF02289263

    Article  Google Scholar 

  • Toussaint GT, Foulsen RS (1979) Some new algorithms and software implementation methods for pattern recognition research. In: COMPSAC 79. Proceedings. Computer software and The IEEE computer society’s third international applications conference, pp 59–63

  • Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybernet 2(3):408–421

    Article  Google Scholar 

  • Yodjaiphet A, Theera-Umpon N, Auephanwiriyakul S (2015) Instance reduction for supervised learning using input-output clustering method. J Cent South Univ 22(12):4740–4748

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Archana R. Panhalkar.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Panhalkar, A.R., Doye, D.D. An approach of improving decision tree classifier using condensed informative data. Decision 47, 431–445 (2020). https://doi.org/10.1007/s40622-020-00265-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40622-020-00265-3

Keywords

Navigation