1 Introduction

Since water quality depends on many physico-chemical parameters such as pH, Total Dissolved Solids (TDS), Electrical Conductivity (EC) etc., its study often requires multivariate statistical methods like multiple linear regression (MLR), factor analysis (FA), principal component analysis (PCA), structural equation modeling (SEM) etc. Introduction of different software for performing these methods has further increased their applicability. FA is a statistical method applied to exploit the correlation between different observed variables to express the variance among them in terms of a potentially lower number of unobserved variables (Kim and Mueller 1978; Warne and Larsen 2014), thus reducing the dimension of analysis. Water quality studies using FA include Reeder et al. (1972), Ashley and Lloyd (1978), Suk and Lee (1999), Locsey and Cox (2003), among others. PCA is another statistical method like FA, which transforms a possibly correlated set of variables into a smaller set of uncorrelated variables called principal components (PC) (Dunteman 1989; Shlens 2003). Using the first few PCs, we can represent a big data set in the component space, thus reducing the dimension of the data set. PC loadings give an idea of the contribution of different variables to that component. Therefore, by analyzing the component loadings, information regarding relation between observed variables can be drawn, which can be used to improve a regression model, especially when the variables exhibit strong correlation. Studies that applied PCA include Mazlum et al. (1999), Petersen et al. (2001), Kotti et al. (2005), Chenini and Khemiri (2009), Amiri and Nakane (2009), Koklu et al. (2010), Eslamian et al. (2010), Bhardwaj et al. (2010), Olsen et al. (2012), among others. Among these studies, Petersen et al. (2001), Chenini and Khemiri (2009), Amiri and Nakane (2009), Koklu et al. (2010), and Eslamian et al. (2010) combine the PCA and MLR methods. SEM is an advanced multivariate statistical method, which can be used to test as well as develop more than one MLR models related to a single problem. This is because the method can treat a variable both as dependent and independent, so that some variables which appear independent while predicting a dependent variable can build another MLR between them (Bentler 1988, 1990; Kline 2005; Byrne 2009; Iacobucci 2010). Recently, Chenini and Khemiri (2009) beautifully combined the PCA, MLR and SEM techniques for analyzing water quality data. Precisely, they applied PCA in reducing the dimension of the regression model by avoiding the variables pH, potassium K+ and temperature T, whose loadings contributed poorly to the first and second PC. They then developed a MLR model for predicting TDS with respect the other variables, and finally, test the MLR model using a SEM.

Kerala state is facing the challenges from rapid urbanization which result in depletion of agricultural areas and natural resources including drinking water. Due to pollution, the state nowadays faces a shortage of drinking water, even though it is blessed with heavy rainfall, especially in places near the cities (Kerala Vision 2030). This underscores the importance of analyzing and preserving ground water quality for the well being of the state. There has been reports and analysis of the physico-chemical parameters of ground water at various places in the Kerala state (Chaudhary and Rachana Pillai 2009; Shaji et al. 2009; Joseph and Claramma 2010; Sujitha et al. 2012; Divya and Manonmani 2013; Subin and Miji 2013). However, to the best of our knowledge, a study of the physico-chemical characteristics of water has not been conducted yet using multivariate statistical methods.

The objective of this study is to develop a MLR model for predicting TDS in terms of different physico-chemical parameters of ground water of the Kozhikode District, Kerala State, India. As in Chenini and Khemiri (2009), first we applied PCA to reduce the number of variables in the model. We then developed a MLR model to predict TDS. Finally, we applied SEM to further validate the MLR developed model, using the variety of fit indices associated with the SEM (Schermelleh-Engel et al. 2003; Hooper et al. 2008), which we believe is the novelty of the current study.

2 Materials and Methods

The study covers an area of about 2344 km2 located on the south west coast of India (Fig. 1). It lies between 11° 7N and 11° 49N and 75° 32E and 76° 9E. The district of Kozhikode has a 362.85 km2 sandy coastal belt, a 1343.50 km2 lateritic midland and a 637.65 km2 Rocky high land. To the west side of the city expands the Laccadive Sea and from approximately 60 km to the east rise the Sahyadri Mountains. Kozhikode features a tropical monsoon climate. Like many other parts of the Kerala state, Kozhikode receives ample rain from the South-west monsoon from June to September and from the North-East monsoon during the second half of October through November (Kozhikode 2014).

Fig. 1
figure 1

Study area and sampling locations

Ground water in the Kozhikode district occurs mainly in weathered and fractured crystalline rocks and also in laterite and alluvial deposits. Ground water occurs under phreatic condition in the weathered zone. The depth to the water table varies from 2.00 to 16.05 m during the pre-monsoon period and from 0.55 to 11.40 m during the post-monsoon period (Joji 2009). Water is extracted by dug wells for domestic and irrigation purposes in this zone. Semi-confined and confined conditions exist in the deep fracture zone where the depth to the water table varies between 10.6 and 169.2 m (Joji 2009). Water is extracted through bore wells. Phreatic aquifers exist in the lateritic midlands of Kozhikode, where the depth to the water table varies from 2.11 to 16.86 m during the pre-monsoon period and from 0.33 to 11.84 m during the post-monsoon period (Joji 2009). Water is extracted by dug wells in this zone. Both riverine and coastal alluvium- are found in the district, where ground water occurs under phreatic conditions. The depth to the water table ranges from 2.00 to 6.63 m during the pre-monsoon period and from 0.99 to 4.03 m during the post-monsoon period (Joji 2009). Water is extracted by dug wells from this zone.

A total of 38 water samples were collected from wells in different parts of the study area (Fig. 1) in July 2014. Samples were collected in cleaned and well-dried white tight capped high quality polyethylene bottles (2.5 L) taking the necessary precautions. These bottles were labeled with respect to collection points, date and time in order to avoid any error between collection and analysis. All the collected samples were immediately transported to the laboratory under low temperature conditions in ice-box and stored in the laboratory for determining both physical and chemical parameters. All the chemicals used were AR grade of pure quality. Double distilled water was used for the preparation of all the reagents and solutions. Glassware were cleaned with commercial HCl followed by distilled water. All analyses were completed within a week time in laboratory.

The ground water samples were analyzed for pH, electrical conductivity (EC), total dissolved solids (TDS), bicarbonates (HCO3 ), chloride (Cl), sulphate (SO4 2−), sodium (Na+), calcium (Ca2+) magnesium (Mg2+), nitrates (NO3 ) and total hardness (TH), following the standard methods of the American Public Health Association (APHA 2012). EC, TDS, pH, chloride and nitrate were measured using CyberScan pH 6000. Among the major cations, sodium, potassium, calcium and magnesium were analyzed by flame photometer (Systronics 333), and TH and hardness were found by EDTA titration. Sulphate was found by gravimetric analysis. Results of the analysis are given in Table 1.

Table 1 Physico-chemical parameters of drinking water at studied watersheds for Kozhikode District, Kerala State, India

It can be verified that most of the samples in Table 1 do not satisfy a perfect ionic balance (the ratio of cations to anions equal to 1), which should not be taken as a result of inaccurate analysis at this point, since the intention in the current study is to develop a regression model for predicting TDS as a function of the parameters analyzed. Later, the reader can verify the fit of the model with respect to two different approaches, namely MLR and SEM.

For analyzing the data presented in Table 1, a PCA was conducted, which helps to visualize the data set when there are a number of variables involved. PCA achieves this by identifying those variables which show a similar impact on the system characteristics, and thereby, reducing the dimension of the problem by transferring the data to the principal component space. PCA was performed using MATLAB 2009a. After PCA, a MLR was conducted again using MATLAB 2009a to obtain a regression model in terms of those variables which contribute the most to PC1 and PC2. Finally, using SEM with IBM SPSS AMOS 22.0, the validity of the developed regression model was tested.

3 Results and Discussion

Electrical conductivity of ground water is directly related to the TDS and can, therefore, be used for an approximate estimation of the TDS (Wood 1976; Hem 1985). Precisely, a relation of the form TDS = k·EC can be used to estimate TDS from EC, where k lies between 0.55 and 0.75 (Hem 1985). From our sample data, we found the regression equation TDS = 0.4638 EC, with an R2 value 0.9926 and root mean square error (RMSE) equal to 4.94. The comparatively low TDS/EC ratio can be attributed to the lower presence of ions (Ali 2010), which lead us to a more detailed study of chemical quality of water, whose results are given in Table 1.

The main aim of the study is to develop a regression model for predicting TDS involving parameters other than EC. For this, we started with a Principal Component Analysis (PCA) of the data in Table 1 excluding EC. PCA helps to reduce the dimension of the model keeping most of the information, and thereby, helps to find a possibly hidden simplified model. In other words, PCA helps to find those parameters which are most significant in an experiment which studies many parameters (Shlens 2003). In a recent study, Chenini and Khemiri (2009), in an effort to develop a regression model for predicting TDS from a similar data set, conducted a PCA and found that the parameters pH and K+ were not significant in this regression model.

Our PCA shows that the percentage of total variance explained by the first four principal components are 77.3009, 6.6557, 4.5201 and 2.8136, respectively, which amounts to around 91 % of the total variance. Table 2 gives the coefficients of the first four PCs. From the Table it follows that TDS is the most significant and pH is the least significant parameter for the first PC, which accounts for 77 % of the variance. This can be visualized more easily from Fig. 2. The blue vectors, which represent the coefficients (see Table 2 for numerical values) of the first two PCs, show the predominance of the parameter TDS and also the inconsequentiality of the parameter pH. The red dots represent the data in Table 1 (excluding EC) in the PC space (formed by PC1 as the x-coordinate and PC2 as the y-coordinate). Concentration of data around the x-axis shows the dominance of the first PC in the total variance. This leads to a regression model for predicting TDS using the parameters HCO3 , SO4 2−, NO3 , Cl, Ca2+, Mg2+, K+ and Na+.

Table 2 Coefficient of the first four principal components
Fig. 2
figure 2

Coefficients of first two PCs (blue vectors) and representation of Table 1 data in the PC space (red dots)

Results of the first regression analysis are given in Table 3, which shows that the R-square value is 0.9891 and the p-value for the F-statistic is zero. The p-values indicate that the regression coefficient of Ca2+, Mg2+, NO3 , Na+ and Cl are statistically significant, and the magnitude of t values suggest that these, in respective order, are the most significant parameters. The p-values, which are larger than 0.05 for the regression coefficients of K+, SO4 2− and HCO3 suggest that these parameters are not statistically significant at a 5 % significance level. From assessing the t-statistic and its p-value, it can be inferred that K+ is the most insignificant parameter. Hence, a second regression model was considered by excluding K+, and the results are given in Table 4. Comparing Tables 3 and 4, it is seen that there is no change in the order of the most significant parameters (Ca2+, Mg2+, NO3 , Na+ and Cl, respectively), that the R-square is almost the same, and that there is a decrease in the Mean Square Error (MSE) and an increase in the F-statistic. Also, the maximum p-value for a regression coefficient is now 0.1534. All these suggest a better regression model. However, like in the first regression model, the parameters HCO3 and SO4 2− are insignificant. This made us consider a third regression model excluding HCO3 and SO4 2− also. It may be noted from Table 5, that the p-value for each regression coefficient is now less than 0.05; however, compared to the previous models, there is a decrease in the R-square value and an increase in the MSE. Also, there is a change in the order of significance of the parameters, which has now become Ca2+, Mg2+, Na+, NO3 and Cl, respectively. Thus, among the parameters analyzed, TDS seems to mainly depend on Ca2+, Mg2+, NO3 , Na+ and Cl, with Ca2+ being the most significant parameter.

Table 3 Results of regression analysis with K+ included
Table 4 Results of regression analysis with K+ excluded
Table 5 Results of regression analysis with K+, HCO3 and SO4 2− excluded

Importance of calcium in water for the growth of fish has been reported by Wurts (1993). Hincks and Mackie (1997) and Prescott and Claudi (2012) studied the correlation of presence of calcium and presence of mussels in water. Laxmilatha et al. (2009) reported the successful mussel farming in various places of Kozhikode, especially those near the sea. Hence, intuitively, calcium is a significant parameter of water in Kozhikode and the regression model study supports this intuition.

Even though the p-values for some coefficients are >0.05 in model 2, comparatively lesser values of R-square and MSE made us to accept it as our regression model for predicting TDS in the study area. We thus report the following regression equation for predicting TDS:

$$ \mathrm{T}\mathrm{D}\mathrm{S}=2.9314\cdot {\mathrm{Ca}}^{2+}+4.1791\cdot {\mathrm{Mg}}^{2+}+1.6788\cdot {\mathrm{NO}}_3^{-}+2.1566\cdot {\mathrm{Na}}^{+}+0.9224\cdot {\mathrm{Cl}}^{-}+0.4313\cdot {\mathrm{HCO}}_3^{-}+0.4262\cdot {\mathrm{SO}}_4^{2-}-4.3702 $$

Finally, the validity of the three regression models developed was checked using SEM. Each SEM contains the corresponding regression model and some other causal relationships between the independent variables (HCO3 , SO4 2−, NO3 , Cl, Ca2+, Na+, Mg2+, K+). There are many fit indices that are widely used to evaluate how well a SEM fits the given dataset (Schermelleh-Engel et al. 2003; Hooper et al. 2008); however, Barrett (2007) emphasizes the importance of reporting the results of χ 2 test in adjudging a SEM fit. Like any sample study, the question of how good a sample size must be to ensure reliable results arises in the case of SEM also. Studies that address this question include Hoelter (1983), Bentler (1990), Bollen (1990), Kline (2005), Iacobucci (2010), Westland (2010) and Wolf et al. (2013), among others. Though there are some suggestions in Hoelter (1983) (i.e., Hoelter’s critical N) and in Westland (2010) about the lower bounds for the sample size, it has not been proved that a model can not be accepted if the sample size is less than a certain value. This emphasizes the relevance of fit indices, which reflect the model fit irrespective of the sample size. Marsh et al. (1988) suggested that the non-normed fit index (NNFI) or the Tucker-Lewis index (TLI) are relatively independent of the sample size. Here, to evaluate the SEM fit of the data, we report the following fit indices: (i) the χ 2 statistic, the degrees of freedom and the corresponding p-value; (ii) the Tucker-Lewis index (TLI); (iii) the root mean square error of approximation (RMSEA); (iv) the root mean square residual (RMR); (v) the goodness of fit statistic (GFI) and the adjusted goodness of fit statistic (AGFI); (vi) the parsimony goodness of fit index (PGFI); and (vii) the normed fit index (NFI) and the comparative fit index (CFI). Following are the results of SEM.

  1. SEM 1

    Figure 3, which presents SEM 1, shows that the main model in SEM 1 is the regression model 1, which predicts TDS in terms of HCO3 , SO4 2−, NO3 , Cl, Ca2+, Na+, Mg2+ and K+. It also contains four other sub-models for predicting HCO3 , SO4 2−, NO3 and Cl, respectively, using Ca2+, Na+, Mg2+ and K+. Table 6 gives the value of fit indices for SEM 1. It can be seen that χ 2/(degrees of freedom) is less than 2 and p-value is >0.05, which indicates a good fit according to Schermelleh-Engel et al. (2003). The only fit index that does not indicate a good fit is AGFI, which, according to Schermelleh-Engel et al. (2003), indicates an acceptable fit, as it lies between 0.85 and 0.90. During the study, it was found that the inclusion of covariance relation e2 ↔ e5 and also e4 ↔ e5 leads to a good fit with a very small χ 2 value of 0.06, p-value of 0.999, AGFI of 0.996, and all other indices showing even better values; however, suspecting that this could be the result of an over-parameterized model, we decided to select the current model, which is a good fit of the data in Table 1, with respect to many fit indices and is acceptable with respect to AGFI. Now, by examining the regression coefficients of TDS in Fig. 3, one can see that all are almost the same as obtained in the MLR model 1, with the exception of the constant term, something reported in Table 3. This further strengthens the validity of the regression model developed earlier.

    Fig. 3
    figure 3

    SEM of regression model 1

    Table 6 Fit indices for SEM
  2. SEM 2

    This model is obtained from SEM 1 by excluding the direct relationship between TDS and K+. Figure 4 shows SEM 2, and as in the case of SEM 1, the regression coefficients for TDS are almost the same to those of the MLR model 2 in Table 4. Regarding the model fit, Table 6 shows that there is a slight improvement in the χ 2 value, while other fit indices are very close. AGFI again shows an acceptable fit and not a good fit. From Table 6 and Fig. 4, it can be concluded that the SEM study agrees with the MLR method.

    Fig. 4
    figure 4

    SEM of regression model 2 (with K+→TDS relation excluded)

  3. SEM 3

    Similar to SEM 2, this model is obtained from SEM 1 by excluding the direct relationship between TDS, K+, HCO3 and SO4 2−, exactly as in the case of MLR model 3. Figure 5 shows SEM 3. An examination of Fig. 5 and Table 5 reveals the same regression coefficients for predicting TDS, in the case of SEM 3 and MLR model 3. However, according to Table 6, the fit indices are not satisfactory, especially the AGFI which does not even indicate an acceptable fit. This again justifies our earlier decision to accept MLR model 2 in predicting TDS based on the data analyzed.

    Fig. 5
    figure 5

    SEM of regression model 3 (with K+, HCO3 , SO4 2−→TDS relations excluded)

Table 6 shows that for all the three SEMs, the fit index TLI is very close to 1, which indicates a very good fit. Since TLI is relatively independent of the sample size, we can accept the models developed even though our sample size is small.

4 Conclusion

A multivariate statistical study of the physico-chemical parameters of the ground water of the Kozhikode District, Kerala State, India was conducted. First a linear regression model involving TDS and EC was developed, which revealed a comparatively lower TDS/EC ratio. A PCA was then carried out for identifying a possibly lesser number of variables which contain the essence of the entire data. Hence, a regression model for predicting TDS using the parameters HCO3 , SO4 2−, NO3 , Cl, Ca2+, Mg2+, K+ and Na+ was formed. This model was then modified by excluding the less significant parameter K+. A third regression model was then developed by excluding less significant HCO3 and SO4 2−, where all the parameters were found significant. Due to lower values for R-square and MSE, the second model was selected for predicting TDS. Finally, the validity of the three regression models studied was tested using SEM, which revealed almost the same regression model for predicting TDS. Several fit indices indicated a very good SEM fit of the data besides the comparatively small sample size.