Application of classification methods to analyze chemicals in drinking water quality

  • Muhammad Azam
  • Asma Arshad
  • Muhammad AslamEmail author
  • Sadia Gulzar
General Paper


To analyze drinking water dataset, various statistical methods have been applied, including discriminant analysis, logistic regression and cluster analysis, to construct models for the identification of important input variables. Among them decision trees are more flexible than other statistical classification methods because it provides us a complete path or frame to reach a specific decision with simplicity and ease of understanding about critical variables. This article describes the application of classification decision trees for the analysis of drinking water quality affecting variables and includes discussion about these based on various methods as well as their comparison to reach the best approach for the further analysis about understudy area. In this study, samples of filtered water are taken from 100 pumps located in different union councils of the Lahore city. The classification trees are constructed on the basis of input quality variables, and the results are reported in the form of confusion matrix. Four techniques, including Chi-square Automatic Interaction Detector, Exhaustive Chi-square Automatic Interaction Detector, Classification and Regression Tree and Quick Unbiased Efficient Statistical Tree, were used. Three experiments were conducted to get performance evaluation of the models by the number of misclassified units. The first method used complete dataset, the second one is based on the cross-validation, while the last one is based on the random subsampling.


Classification trees CHAID ECHAID CRT QUEST Cross-validation 



The authors are deeply thankful to the editor and the reviewers for their valuable suggestions to improve the quality of this manuscript. This work was supported by the Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah. The author, Muhammad Aslam, therefore, acknowledges with thanks for DSR technical support.


  1. 1.
    Gupta A, Gupta S, Patil RS (2003) A comparison of water quality indices for coastal water. J Environ Sci Health C 38(11):2711–2725CrossRefGoogle Scholar
  2. 2.
    Chawla R, Hunter PR (2005) Classification of bathing water quality based on the parametric calculation of percentiles is unsound. Water Res 39(18):4552–4558CrossRefGoogle Scholar
  3. 3.
    Azoulay A, Garzon P, Eisenberg MJ (2001) Comparison of the mineral content of tap water and bottled waters. J Gen Intern Med 16(3):168–175CrossRefGoogle Scholar
  4. 4.
    Baroni L, Cenci L, Tettamanti M, Berati M (2007) Evaluating the environmental impact of various dietary patterns combined with different food production systems. Eur J Clin Nutr 61(2):279–286CrossRefGoogle Scholar
  5. 5.
    Astel A, Tsakovski S, Barbieri P, Simeonov V (2007) Comparison of self-organizing maps classification approach with cluster and principal components analysis for large environmental data sets. Water Res 41(19):4566–4578CrossRefGoogle Scholar
  6. 6.
    Motamarri S, Boccelli DL (2012) Development of a neural-based forecasting tool to classify recreational water quality using fecal indicator organisms. Water Res 46(14):4508–4520CrossRefGoogle Scholar
  7. 7.
    Azam M, Aslam M, Khan K, Mughal A, Inayat A (2017) Comparisons of decision tree methods using water data. Commun Stat Simul Comput 46(4):2924–2934CrossRefGoogle Scholar
  8. 8.
    Sakizadeh M (2015) Assessment the performance of classification methods in water quality studies, a case study in Karaj River. Environ Monit Assess 187(9):1–12CrossRefGoogle Scholar
  9. 9.
    Breiman L (1984) Classification and regression trees. Routledge, New YorkGoogle Scholar
  10. 10.
    Morgan JN, Sonquist AJ (1963) Problems in the analysis of survey data, and a proposal. J Am Stat Assoc 58(302):415–434CrossRefGoogle Scholar
  11. 11.
    Morgan J, Messenger R (1973) THAID: a sequential search program for the analysis of nominal scale dependent variables. Survey Research Center, Institute for Social Research, University of Michigan, p 251Google Scholar
  12. 12.
    Steinberg D, Colla P (1995) CART: tree-structured non-parametric data analysis. Salford Systems, San DiegoGoogle Scholar
  13. 13.
    Martinez WL, Martinez AR (2007) Computational statistics handbook with MATLAB. CRC Press, Boca Raton, p 22Google Scholar
  14. 14.
    Hothorn T, Zeileis A (2019) Partykit: a toolkit for recursive partytioning. R package version 2.1–3.
  15. 15.
    Archer KJ (2010) rpart Ordinal: an R Package for deriving a classification tree for predicting an ordinal response. J Stat Softw 34(7):1–17CrossRefGoogle Scholar
  16. 16.
    Loh WY, Shih YS (1997) Split selection methods for classification trees. Stat Sin 7:815–840Google Scholar
  17. 17.
    Azam M, Zaman Q, Pfeiffer K (2007) Improved classification trees with two or more classes. In: Proceedings of the 9th Islamic countries conference on statistical sciencesGoogle Scholar
  18. 18.
    Huang C-S, Lin Y-J, Lin C-C (2008) Implementation of classifiers for choosing insurance policy using decision trees: a case study. WSEAS Trans Comput 7(10):1679–1689Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Statistics and Computer ScienceUniversity of Veterinary and Animal SciencesLahorePakistan
  2. 2.National College of Business Administration and EconomicsLahorePakistan
  3. 3.Department of Statistics, Faculty of ScienceKing Abdulaziz UniversityJeddahSaudi Arabia
  4. 4.Department of StatisticsKinnaird College for WomenLahorePakistan

Personalised recommendations