An approach of improving decision tree classifier using condensed informative data

Panhalkar, Archana R.; Doye, Dharmpal D.

doi:10.1007/s40622-020-00265-3

An approach of improving decision tree classifier using condensed informative data

Research Article
Published: 28 January 2021

Volume 47, pages 431–445, (2020)
Cite this article

DECISION Aims and scope Submit manuscript

276 Accesses
1 Citation
Explore all metrics

Abstract

The advancement of new technologies in today’s era produces a vast amount of data. To store, analyze and mine knowledge from huge data requires large space as well as better execution speed. To train classifiers using a large amount of data requires more time and space. To avoid wastage of time and space, there is a need to mine significant information from a huge collection of data. Decision tree is one of the promising classifiers which mine knowledge from huge data. This paper aims to reduce the data to construct efficient decision tree classifier. This paper presents a method which finds informative data to improve the performance of decision tree classifier. Two clustering-based methods are proposed for dimensionality reduction and utilizing knowledge from outliers. These condensed data are applied to the decision tree for high prediction accuracy. The uniqueness of the first method is that it finds the representative instances from clusters that utilize knowledge of its neighboring data. The second method uses supervised clustering which finds the number of cluster representatives for the reduction of data. With an increase in the prediction accuracy of a tree, these methods decrease the size, building time and space required for decision tree classifiers. These novel methods are united into a single supervised and unsupervised Decision Tree based on Cluster Analysis Pre-processing (DTCAP) which hunts the informative instances from a small, medium and large dataset. The experiments are conducted on a standard UCI dataset of different sizes. It illustrates that the method with its simplicity performs a reduction of data up to 50%. It produces a qualitative dataset which enhances the performance of the decision tree classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

A review of unsupervised feature selection methods

Article 29 January 2019

A comprehensive survey of data mining

Article 06 February 2020

References

Alvar AS, Abadeh MS (2016) Efficient instance selection algorithm for classification based on fuzzy frequent patterns. In: 2016 IEEE 17th international symposium on computational intelligence and informatics (CINTI), pp 000319–000324
Angiulli F (2005) Fast condensed nearest neighbor rule. In: Proceedings of the 22nd international conference on machine learning, pp 25–32
Bailey K (1994) Numerical taxonomy and cluster analysis. In: Typologies and Taxonomies, vol 34, pp 24
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. CRC Press, Cambridge
Google Scholar
Cavalcanti GD, Ren TI, Pereira CL (2013) ATISA: adaptive threshold-based instance selection algorithm. Expert Syst Appl 40(17):6894–6900
Article Google Scholar
Chao S, Chen L (2005) Feature dimension reduction for microarray data analysis using locally linear embedding. In: Proceedings of the 3rd Asia-Pacific bioinformatics conference, vol 1, pp 211–217
Chen G, Cheng Y, Xu J (2008) Cluster reduction support vector machine for large-scale data set classification. In: 2008 IEEE Pacific-Asia workshop on computational intelligence and industrial application, vol 1, pp 8–12
Chou CH, Kuo BH, Chang F (2006) The generalized condensed nearest neighbor rule as a data reduction method. In: 18th international conference on pattern recognition (ICPR’06), vol 2. IEEE, pp 556–559
Czarnowski I (2012) Cluster-based instance selection for machine classification. Knowl Inf Syst 30(1):113–133
Article Google Scholar
Dua D and Graff C (2019) UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml
Gates G (1972) The reduced nearest neighbor rule (Corresp.). IEEE Trans Inf Theory 18(3):431–433
Article Google Scholar
Han J, Kamber M (2006) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers, Burlington, pp 223–357
Google Scholar
Hart P (1968) The condensed nearest neighbor rule (Corresp.). IEEE Trans Inf Theory 14(3):515–516
Article Google Scholar
Hernandez-Lea P, Carrasco-Ochoa JA, Martínez-Trinidad JF, Olvera-Lopez JA (2013) InstanceRank based on borders for instance selection. Pattern Recognit 46(1):365–375
Article Google Scholar
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–66
Article Google Scholar
Kass GV (1980) An exploratory technique for investigating large quantities of categorical data. Appl Stat 29:119–127
Article Google Scholar
Lumini A, Nanni L (2006) A clustering method for automatic biometric template selection. Pattern Recognit 39:495–497
Article Google Scholar
Marchiori E (2008) Hit miss networks with applications to instance selection. J Mach Learn Res 9:997–1017
Google Scholar
Nikolaidis K, Goulermas JY, Wu QH (2011) A class boundary preserving algorithm for data condensation. Pattern Recognit 44(3):704–715
Article Google Scholar
Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF (2010) A new fast prototype selection method based on clustering. Pattern Anal Appl 13(2):131–141
Article Google Scholar
Ougiaroglou S, Evangelidis G (2012) Efficient dataset size reduction by finding homogeneous clusters. In: Proceedings of the fifth Balkan conference in informatics. ACM, pp 168–173
Pechenizkiy M, Tsymbal A, Puuronen S (2006) Local dimensionality reduction and supervised learning within natural clusters for biomedical data analysis. IEEE Trans Inf Technol Biomed 10(3):533–539
Article Google Scholar
Peng K, Leung VC, Huang Q (2018) Clustering approach based on mini batch kmeans for intrusion detection system over big data. IEEE Access 6:11897–11906
Article Google Scholar
Phinyomark A, Pornchai P, Chusak L (2012) Feature reduction and selection for EMG signal classification. Expert Syst Appl 39(8):7420–7431
Article Google Scholar
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):119–127
Google Scholar
Quinlan JR (1993) Programming for machine Learning. Morgan Kaufman, San Francisco
Google Scholar
Randall WD, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38(3):257–286
Article Google Scholar
Sanguinetti G (2008) Dimensionality reduction of clustered data sets. IEEE Trans Pattern Anal Mach Intell 30(3):535–540
Article Google Scholar
Sathyadevan S, Nair RR (2015) Comparative analysis of decision tree algorithms: ID3, C4.5 and random forest. In: Computational intelligence in data mining, vol 1. Springer, New Delhi, pp 549–562
Tang T, Chen S, Zhao M, Huang W, Luo J (2019) Very large-scale data classification based on K-means clustering and multi-kernel SVM. Soft Comput 23(11):3793–3801
Article Google Scholar
Thorndike RL (1953) Who belongs in the family? Psychometrika 18:267–276. https://doi.org/10.1007/BF02289263
Article Google Scholar
Toussaint GT, Foulsen RS (1979) Some new algorithms and software implementation methods for pattern recognition research. In: COMPSAC 79. Proceedings. Computer software and The IEEE computer society’s third international applications conference, pp 59–63
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybernet 2(3):408–421
Article Google Scholar
Yodjaiphet A, Theera-Umpon N, Auephanwiriyakul S (2015) Instance reduction for supervised learning using input-output clustering method. J Cent South Univ 22(12):4740–4748
Article Google Scholar

Download references

Author information

Authors and Affiliations

Shri Guru Gobind Singhji Institute of Engineering and Technology, Vishnupuri, Nanded, Maharashtra, India
Archana R. Panhalkar & Dharmpal D. Doye

Authors

Archana R. Panhalkar
View author publications
You can also search for this author in PubMed Google Scholar
Dharmpal D. Doye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Archana R. Panhalkar.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Panhalkar, A.R., Doye, D.D. An approach of improving decision tree classifier using condensed informative data. Decision 47, 431–445 (2020). https://doi.org/10.1007/s40622-020-00265-3

Download citation

Accepted: 10 November 2020
Published: 28 January 2021
Issue Date: December 2020
DOI: https://doi.org/10.1007/s40622-020-00265-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An approach of improving decision tree classifier using condensed informative data

Abstract

Access this article

Similar content being viewed by others

Feature selection techniques for machine learning: a survey of more than two decades of research

A review of unsupervised feature selection methods

A comprehensive survey of data mining

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An approach of improving decision tree classifier using condensed informative data

Abstract

Access this article

Similar content being viewed by others

Feature selection techniques for machine learning: a survey of more than two decades of research

A review of unsupervised feature selection methods

A comprehensive survey of data mining

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation