Skip to main content

Classification Through Data Mining Algorithm

  • Conference paper
  • First Online:
Computer Vision and Robotics

Part of the book series: Algorithms for Intelligent Systems ((AIS))

  • 502 Accesses

Abstract

Classification is a data mining technique in the machine learning domain. Various algorithms such as K-nearest neighbor, support vector machines, random forest, logistic regression, and decision trees are used to solve the classification problem. Out of them, logistic regression and decision trees are perhaps the most used classification techniques. The study seeks to compare the performance of these two techniques in classifying observations from two different data sets. This study aims to identify cases where one would prefer a particular algorithm over the other and explore the advantages and disadvantages associated with these two algorithms. After a data discovery phase, which included checking for missing values and removing redundant predictors, logistic regression and decision tree models were built for both datasets. The models were compared based on their ROC curves, and their predictive ability was obtained from their confusion matrices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Han J, Kamber M, Pei J (2006) Data mining: concepts and techniques. Elsevier Science

    Google Scholar 

  2. Rudd JM, GStat MPH, Priestley JL (2017) A comparison of decision tree with logistic regression model for prediction of worst non-financial payment status in commercial credit Published and Grey Literature from PhD Candidates 5. https://digitalcommons.kennesaw.edu/dataphdgreylit/5

  3. Andrews PJD et al (2002) Predicting recovery in patients suffering from traumatic brain injury by using admission variables and physiological data: a comparison between decision tree analysis and logistic regression. J Neurosurgery 97:326–336

    Google Scholar 

  4. Kohavi R (1995) A study of cross-validation and boostrap for accuracy estimation and model selection. In: Proceedings of the fourteenth international joint conference on artificial intelligence, pp 1137–1143

    Google Scholar 

  5. Hosmer DW, Lemeshow S (2000) Applied logistic regression, 2nd edn. Wiley (2000)

    Google Scholar 

  6. Kleinbaum DG, Klein M (2010) Logistic regression, 3rd edn. Springer, New York

    Book  Google Scholar 

  7. Barros RC et al (2015) Automatic design of decision tree induction algorithms. Springer, International Publishing

    Book  Google Scholar 

  8. Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat 15:651–674

    Article  MathSciNet  Google Scholar 

  9. Lincoff GH (Pres) (1981) The audubon society field guide to north american mushrooms. Alfred A. Knopf, New York

    Google Scholar 

  10. Antonio N, Almeida A, Nunes L (2019) Hotel booking demand datasets. Data Brief 22

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Appendix

Appendix

See Table 1.

Table 1 Complexity parameter (CP) table showing the cross-validation error (xerror) and the size of a tree obtained corresponding to different CP values with respect to the deep/parent tree
Fig. 4
figure 4

Distribution of the edibility of a mushroom with respect to its cap color and cap surface

Fig. 5
figure 5

Distribution of the edibility of a mushroom with respect to its cap shape and cap color

Fig. 6
figure 6

Distribution of the edibility of a mushroom with respect to its odor

Fig. 7
figure 7

Structure of the hotel dataset

Fig. 8
figure 8

Distribution of hotel stays throughout the year

Fig. 9
figure 9

Distribution of hotel stays with respect to average daily rate of a hotel

Fig. 10
figure 10

Decision tree obtained for the mushroom dataset

Fig. 11
figure 11

Confusion matrix obtained after applying the decision tree on the training dataset

Fig. 12
figure 12

Confusion matrix obtained after applying the decision tree on the testing dataset

Fig. 13
figure 13

Unpruned tree obtained with CP = 0.00001 when applied on the training data of the mushroom dataset

Fig. 14
figure 14

Pruned tree obtained pruning the unpruned tree at a CP value of 0.0020747

Fig. 15
figure 15

Confusion matrix obtained after applying the pruned tree on the training set of the mushroom data set

Fig. 16
figure 16

Confusion matrix obtained after applying the pruned tree on the testing set of the mushroom data set

Fig. 17
figure 17

ROC curve obtained showing the performance of the decision tree model on the testing set of the mushroom dataset

Fig. 18
figure 18

Confusion matrix obtained after applying the logistic regression model on the testing set of the hotel dataset

Fig. 19
figure 19

Decision tree obtained for the hotel dataset

Fig. 20
figure 20

Confusion matrix obtained on applying the decision tree model on the training set of the hotel dataset

Fig. 21
figure 21

Confusion matrix obtained on applying the decision tree model on the testing set of the hotel dataset

Fig. 22
figure 22

Confusion matrix obtained on applying the decision tree which is trained on the imbalanced training set on the testing set of the hotel dataset

Fig. 23
figure 23

ROC curve showing the performance of logistic regression and decision tree on the testing set of the hotel dataset

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Roy, D., Chatterjee, S. (2022). Classification Through Data Mining Algorithm. In: Bansal, J.C., Engelbrecht, A., Shukla, P.K. (eds) Computer Vision and Robotics. Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-16-8225-4_7

Download citation

Publish with us

Policies and ethics