Decision Tree-Based Data Mining and Rule Induction for Identifying High Quality Groundwater Zones to Water Supply Management: a Novel Hybrid Use of Data Mining and GIS

Groundwater is an important source to supply drinking water demands in both arid and semi-arid regions. Nevertheless, locating high quality drinking water is a major challenge in such areas. Against this background, this study proceeds to utilize and compare five decision tree-based data mining algorithms including Ordinary Decision Tree (ODT), Random Forest (RF), Random Tree (RT), Chi-square Automatic Interaction Detector (CHAID), and Iterative Dichotomiser 3 (ID3) for rule induction in order to identify high quality groundwater zones for drinking purposes. The proposed methodology works by initially extracting key relevant variables affecting water quality (electrical conductivity, pH, hardness and chloride) out of a total of eight existing parameters, and using them as inputs for the rule induction process. The algorithms were evaluated with reference to both continuous and discrete datasets. The findings were speculative of the superiority, performance-wise, of rule induction using the continuous dataset as opposed to the discrete dataset. Based on validation results, in continuous dataset, RF and ODT showed higher and RT showed acceptable performance. The groundwater quality maps were generated by combining the effective parameters distribution maps using inducted rules from RF, ODT, and RT, in GIS environment. A quick glance at the generated maps reveals a drop in the quality of groundwater from south to north as well as from east to west in the study area. The RF showed the highest performance (accuracy of 97.10%) among its counterparts; and so the generated map based on rules inducted from RF is more reliable. The RF and ODT methods are more suitable in the case of continuous dataset and can be applied for rule induction to determine water quality with higher accuracy compared to other tested algorithms.


Introduction
Groundwater is an important source of drinking water supply in arid and semi-arid areas, which generally encounter water shortages arising from climate change. Owing to climatic change and alterations in global precipitation patterns, the quantity of drought years has increased in many countries located in arid and semi-arid regions of the world. Under such circumstances, permanent rivers will change to seasonal rivers (Zarghami et al. 2011) and groundwater becomes the main source to supply water demands, especially for drinking purposes.
Determining groundwater quality and locating areas with high quality water for drinking purpose is the principle challenge in this regard. There are a variety of standards specified to determine groundwater quality, put forth by World Health Organization (WHO) and/or relevant organizations in different countries. Water quality standards in Iran are specified by the Institute of Standards and Industrial Research of Iran (ISIR). WHO and ISIR standards have specify only maximum contamination levels and admissible limits of water suitable for drinking purposes. Although parameters pertaining to underground water quality may very well be within standard limits, they lack the same quality. Furthermore, each parameter may vary along a wide range prior to reaching its admissible limit.
Employing a few parameters to determine water quality as opposed to several quality parameters (variables) in different ranges and units is one possible and rather interesting approach, which is, however, difficult and requires a Decision support system (DSS). Accordingly, the combination of these parameters and classification of water quality appear to be an important issue where the critical role of DSS is concerned. DSS assists decision makers in optimizing their decisions (Turban 1993). Moreover, Geographic Information System (GIS) is highlighted as quite an effective tool apropos of its spatial decision support role and function in developing GIS-based spatial DSS (SDSS) (Jeihouni et al. 2015).
Several studies have thus far been conducted with the objective to assess water quality parameters and specify their spatial distribution using GIS; e.g. D' Agostino et al. (1998), Hudak (2000, 2001, Gaus et al. (2003), Hudak and Sanmanee (2003), Yimit et al. (2011), Arslan (2012, Bhunia et al. (2018). These studies have solely focused on the distribution map of water quality parameters, with disregard for the combination of the layers. Nas and Berktay (2010) generated a water quality map for drinking purpose by a simple overlaying of thematic maps of pH, electrical conductivity (EC), chloride, sulfate, hardness, and nitrate. Jeihouni et al. (2015) have developed a Multiple-criteria Spatial Decision Support System (MC-SDSS) where they used pH, EC, chloride, sulfate, and hardness as water quality parameters and employed Analytical Hierarchy Process (AHP) as a multiple-criteria decision-making (MCDM) technique to generate water quality map based on the importance of each parameter. In this approach, the importance of each parameter was determined by experts and the significance of each parameter range, when within the permissible limit, was covered by the weight of parameter.
Data mining and knowledge discovery in databases (KDD) are suitable approaches to extracting patterns of interest from databases (Fayyad et al. 1996). The KDD is the automated extraction of valid, novel, understandable and useful patterns representing knowledge in databases (Rokach and Maimon 2005;Han et al. 2011) and Data mining is the core of the KDD process (Rokach and Maimon 2005). Data mining classifier algorithms have been developed in recent decades and have been broadly used for classification, modeling and rule induction (Peters et al. 2007;Taghizadeh-Mehrjardi et al. 2015;Rodriguez-Galiano et al. 2012Yoo et al. 2016;Rahmati et al. 2016;Naghibi et al. 2017;Sahoo et al. 2018;Arabameri et al. 2019;Chen et al. 2019;Shahbazi et al. 2019;Miraki et al. 2019;Sherafatpour et al. 2019).
Data mining decision tree algorithms aim to develop a classification graph and predict the target class based on the input training dataset. These algorithms are able to learn the relationships between input variables and corresponding outputs, and represent each relationship by specific rules. There are several data mining algorithms based on tree induction and utilized for classification. Some of the more broadly used tree induction algorithms are; Ordinary Decision Tree (ODT), Random Tree (RT), Random Forest (RF), Iterative Dichotomiser 3 (ID3), and Chi-square Automatic Interaction Detector (CHAID). These decision tree algorithms are generated through recursive partitioning. The summary of tree-based data mining algorithms is presented in Table 1. These tree-based algorithms are frequently used in many fields and for different applications (Kim et al. 2011;Rodriguez-Galiano et al. 2012;Rahmati et al. 2016;Yoo et al. 2016;Belgiu and Drăguţ 2016;Hong et al. 2016;Naghibi et al. 2017;Heil et al. 2017;Chen et al. 2017;Robinson et al. 2018;Rayaroth and Sivaradje 2019;Al-Juboori 2019). For specific detail on the theoretical bases of data mining, its algorithms, and application, the reader is referred to the works of Rokach and Maimon (2005), Han et al. (2011), Liao et al. (2012, Rokach and Maimon (2014).
Data mining and rule induction techniques are able to extract rules from data and predict previously unknown events (Yoo et al. 2016). Decision tree-based techniques have a high capability for rule induction and extracting relationship between variables, in order to categorize them into meaningful classes. Given the similarities between the determining water quality based on several parameters and classification, data mining classification algorithms and rule induction techniques can be used to generate water quality maps. The mentioned capability of data mining and rule induction in classification of water quality has not been addressed in any research as of yet. This capability will be tested, for the first time, on groundwater quality mapping.
The prime objectives of this study include: (1) specification of key water quality parameters influencing the determination of water quality, (2) evaluation of the capability of decision tree based rule induction methods such as ODT, RF, RT, ID3 and CHAID in identifying high quality groundwater zones for drinking purposes, and (3) generation of groundwater quality maps in Tabriz City and its township based on key parameters and extracted rules from data mining techniques. This paper proposes a novel approach for water quality mapping by hybrid use of data mining and GIS. The innovation of the study lies in inducing rules from database to combine the layers of the effective water quality parameters to generate the groundwater quality map.

Study Area
The study area is Tabriz City, the capital of the East Azerbaijan province, and its township in northwestern Iran (Fig. 1). The area is located between latitudes 37°56′ and 38°11′ N and longitudes 46°2′ and 46°36′ E, with an approximate area of 1200 km 2 . Tabriz is located in a semi-arid region and has a high population circa 1.5 million. Tabriz water demands are mainly  The training sample set is projected to subsets and for each split the algorithm uses only a random subset of attributes to generate a tree RF * * Nominal The RF creates a set of random trees by repeating the RT process ID3 * Nominal ID3 is very simple decision tree algorithm and generates tree based on a fixed sample set by using a greedy search and the tree model generated without pruning CHAID * Nominal Employs chi-squared based criterion instead of the gain ratio or information gain this algorithm generate tree model without pruning. It is a non-parametric algorithm and uses frequencies instead of mean and variance supplied from two dams located at 30 and 200 Km distances from Tabriz, with smaller contributions from certain wells located at a distance of 25Km from Tabriz (Jeihouni et al. 2015). Recent drought years have threatened the water supply from permanent rivers for  Tabriz. The importance of locating high quality groundwater resources at near distances of Tabriz has been thoroughly discussed by Jeihouni et al. (2015).

Data and Data Preprocessing
Groundwater quality variables such as sulfate, hardness, total anion, pH, calcium, magnesium, chloride, and EC were measured at 80 observation wells of Tabriz urban groundwater and its township (Fig. 1). The dataset has been collected by the Iranian Ministry of Energy (IMOE). Then, based on laboratory result for each parameter, experts of water quality analysis were asked to rank each water sample from class A to I (A being the optimum quality class and I the poor-quality class). The final dataset was eventually prepared for further data mining analysis.  Yang et al. 2019). In this study the effective water quality parameters were selected during the KDD process and pattern evaluation phase for extracting effective factors. In this phase, the RF algorithm was used to discover patterns and select the most effective parameters based on extracted patterns.

Methodology
A three step method was proposed to generate water quality maps. The first step consisted of basic preprocessing, such as creation of a spatial database, data cleaning and extraction of water quality variables. The data were then relocated to an operational repository called an "operational data store" for further processing, integration and additional operations to verify data quality, prior to implementing the data warehouse (DW), the core and the central repository for integrated data. The DW stores data applicable for decision making and creating analytical reports, upon which a professional DW called Data Mart is generated based on the required datasets and parameters before employing data mining pattern recognition techniques. The RF algorithm is then used to extract patterns and discover relevant parameters that affect water quality. The next step includes the use of the most effective parameters as inputs to discover the relationships between parameters and water quality classes in both full range/continuous (numerical) and classified/discrete (nominal) datasets by ODT, RT, RF, ID3 and CHAID methods to induct rules that determine the quality of groundwater for drinking purposes. In the final step, to generate water quality maps, the inducted rules from three high accuracy models were used to combine the spatial distribution (thematic) maps of effective parameters generated using ordinary kriging (OK) as a geostatistical technique. The brief methodology and schema of the proposed method are shown in Figs. 2 and 3.

Results and Discussion
The statistical evaluation results of the groundwater quality variables; sulfate, hardness, total anions, pH, calcium, magnesium, chloride and EC are shown in Table 2, with the corresponding correlation matrix map presented in Fig. 4. The effective parameters for water quality, as determined using data mining processes and pattern extraction, were: hardness, pH, chloride and EC, all of which are highlighted in Table 2. The mentioned factors were subsequently used as input to generate decision trees and perform rule induction. The tree generation algorithms were evaluated on two dataset types; numerical and nominal, so as to assess their rule induction capabilities. In the case of the continuous technique (for numerical dataset), the numerical data were used and trees were generated based on OD, RT and RF methods. The inducted rules were assessed based on statistical criteria such as accuracy, classification error, Kappa coefficient, Spearman rho, Kendall_tau and correlation. The validation results are shown in Table 3.
As evident in Table 3, the prediction capability of RF method was higher than ODT and RT method based on the model evaluation criteria. RF had the lowest classification error and highest accuracy, kappa, Spearman rho and correlation values of 97.10, 2.9, 0.962, 0.996, 0.987 and 0.983, respectively, among other methods, which prove the ability and superiority of this model. Moreover, the accuracy of ODT was higher than RT based on the validation criteria.
In the next step, to generate the decision trees and rule induction for nominal dataset, each effective parameter was classified into three classes based on Ducci (1999) and Jeihouni et al. (2015). The classification thresholds are shown in Table 4. The decision trees were then generated by employing ODT, RT, RF, ID3, and CHAID methods. The rule induction accuracy was assessed based on statistical criteria. Table 5 lists the validation results.
As indicated by Table 5, the accuracies of inducted rules in the nominal dataset were very poor in comparison with the numerical dataset. The maximum Kappa coefficient (0.617) was observed for the ID3 method. ID3 also had the highest accuracy in nominal datasets owing to its capability and compatibility in handling nominal datasets. Despite of ID3's high ability for rule induction from the nominal datasets, this method failed to attain a high accuracy.
The results in Tables 3 and 5 indicate the superiority of numerical decision tree generation algorithms and rule induction methods (ODT, RT, and RF), in terms of accuracy and ability in the case of the continuous datasets. Accordingly, the numerical approach was more suitable for handling, mining and inducing rules for water quality classification.
Based on the results presented in Tables 3 and 5, given the high performances of the applied algorithms in numerical datasets, the water quality decision trees were only generated for ODT, RT, and RF models. An instance of a classification tree generated by the ODT technique is shown in Fig. 5. The generated trees and inducted IF-THEN rules were implemented on water quality parameters distribution maps to generate final water quality maps based on ODT, RT and RF methods. To generate spatial distribution maps of each effective parameter a geostatistical approach of OK was employed.
The implementation of the OK method involves an initial checking of the spatial autocorrelation of effective parameters through Moran's I (index). Figure 6 illustrates a graphical representation of the Moran's I report. As can be observed in Fig. 6, all parameters have cluster patterns and could be modeled by OK. Due to the high performance of the OK in normally distributed datasets (Jeihouni et al. 2018), the normality of effective parameters datasets were assessed through their histograms and statistical criteria of skewness and kurtosis, which are presented in Table 2. pH had a normal distribution, whereas EC, hardness and chloride datasets were not normally distributed. For this reason, the non-normally distributed datasets were normalized by lognormal transformation ( Table 2).
The basis of the OK method is to find the best fitted variogram in order to obtain an accurate estimation of each effective factor at unsampled locations. Accordingly, the 11 commonly used semi-variogram models (e.g. Circular, Spherical, Tetraspherical, Pentaspherical, Exponential, Gaussian, Rational Quadratic, Hole effect, K-Bessel, J-Bessel, and Stable) were tested for hardness, pH, chloride, and EC datasets. The fitted semi-variograms for all parameters are presented in Fig. 7 and their corresponding Fig. 6 The spatial autocorrelation report based on Moran's I (index) for effective parameters model parameters are summarized in Table 6. The minimum and maximum ranges indicates the highest and the lowest spatial variability (Jeihouni et al. 2018), belonged to hardness and EC, respectively. Moreover, the factors nugget to sill ratio (%) was evaluated as a criterion for classifying the spatial dependence (Caro et al. 2013). The ratio less than 25% indicates the strong spatial dependence, the ratio between 25 and 75% indicates the moderate spatial dependence, and the ratio greater than 75% indicates the weak spatial dependence of the variable (Cambardella et al. 1994). According to this criterion, EC, hardness and chloride have strong spatial dependence, while pH has a moderate spatial dependence. Spatial distribution maps of the effective factors were then generated based on the best fitted semi-variogram model for each factor (Fig. 8).
In the final stage, the distribution maps were combined based on the inducted rules from ODT, RT, and RF methods to generate final groundwater quality maps (Figs. 9,10,and 11).
Groundwater quality maps generated using inducted rules from ODT, RT and RF (Figs. 9, 10, and 11) confirm the overall groundwater quality patterns, albeit some inconsistencies were apparent in certain areas. Referring to the final groundwater quality maps, the groundwater quality decrease from south to north and from east to west of the region, indicating quality gradients. All quality maps indicate the southern areas as the optimal choice for drinking water supply, whereas northern, northwestern and western parts of the study region were of poor quality. The groundwater resources located to the south and south east of the Tabriz are charged by the Sahand Mountains. Differences between maps hint of differences among rules inducted from the employed algorithms, which resulted in dissimilar classes. Regarding Fig. 9, the ODT method identified high quality water sources in northeastern, eastern, southeastern, southern, southwestern and central parts of the study area. The RT method (Fig. 10), however identified regions to the south and southeast as optimum quality water sources, while sectors in the northeast, southwest and central areas were labeled as medium quality and northwest of the study area as poor quality. The performance and capability of the RT method in determining optimum and poor-quality zones was lower. Based on RF map (Fig. 11), only the southern and southeastern parts of the study area were Fig. 9 Water quality map using the inducted rules from ODT (optimal in blue and poor in red) Based on the accuracy assessment tests and validation results, all three algorithms had high performances and dependable results; however, the map generated using the RF method was the most reliable and can be used as a base for managerial decisionmaking procedures. This study highlighted the ability of a hybrid approach, incorporating data mining and GIS, in order to determine groundwater quality and locate high quality groundwater sources. Data mining can convert experts' ideas to tangible IF-THEN rules based on the utilized decision tree methods, which can be utilized by non-experts for water supply management.

Conclusion
This study sought to evaluate five common tree-based data mining algorithms (ODT, RT, RF, ID3 and CHAID) with respect to their performance and capabilities in rule induction for identifying high quality groundwater zones for drinking purposes. The main goals of the current study were to induct rules for determining water quality and utilize them to generate water quality maps over the Tabriz city. To achieve these goals, the most relevant water quality parameters were extracted from among eight water quality variables, and were then used as inputs to different data mining algorithms in both continuous (numerical) and discrete (nominal) datasets. The algorithms performances were assessed by specific statistical analysis approaches. The results indicate the superiority of inducted rules from ODT, RT and RF in numerical dataset and an under-performance of all five algorithms in nominal dataset, even ID3 and CHAID, which are generally suitable for nominal datasets. The inducted rules from ODT, RT and RF have superior performance and accuracy, and were consequently used to generate water quality maps. Spatial distribution maps for the key parameters were subsequently generated using OK method and combined based on the inducted rules. The final generated quality maps demonstrate the groundwater quality gradient, wherein water quality decreases from south to north and from east to west of the study region. Decision tree-based data mining algorithms and rule induction approaches showed a relatively high capability in classifying water quality based on a limited dataset. The RF method had the highest performance and the generated quality map was more reliable. Finally, it is recommended that the RF and ODT methods be used to induce rules for water quality determination in numerical datasets.
Funding Information Open access funding provided by Lund University.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.