1 Introduction

Well water is generally used for agricultural activities or as drinking water. Its quality affects human activities and, consequently, the health of the population. Wells are the main source of water for the inhabitants of M’pody, a village in Anyama, a suburb north of Abidjan (Côte d’Ivoire). The water that these populations use for drinking is not always treated. This can lead to illness. This was the case in the village M’pody, where an epidemic of diarrhoea was detected in January 2020. This epidemic affected sixty-nine (69) people, the majority were children aged 0 to 5. It is therefore necessary to quantitatively assess the characteristics of well water in order to find links between the quality of the water in these wells and this epidemic. The traditional method of assessing water quality involves analyzing physicochemical and microbiological parameters and comparing them with existing standards, in order to inform the public about the environmental conditions of these waters and the measures to be taken. With this in mind, Agbasi et al [2] studied the contamination of sachet water using an analysis of physicochemical parameters, heavy metals and microbial loads tested in sachet water in the six geopolitical zones of Nigeria, during the period 2020–2023. The manufacture, delivery, storage and sale of sachet water, as well as poor environmental hygiene, were identified as potential sources of contamination. Abba et al [1] used spatial, chemometric and indexical approaches to assess trace element pollution in the multi-aquifer groundwater system of the Al-Hassa oasis in Saudi Arabia. The average values revealed that chromium and iron concentrations exceeded the recommended limits for drinking water quality. The heavy metal assessment index, the heavy metal pollution index and the modified heavy metal index indicated low levels of groundwater pollution. Chemometric analysis identified human activities and geogenic factors as contributing to groundwater pollution. In a similar vein, Gobinder et al [13], assessed the seasonal suitability of groundwater for irrigation using indexed approaches, statistical calculations, graphical plots and machine learning algorithms. They concluded that seasonal changes in groundwater quality for irrigation are influenced by monsoon dynamics, showing significant changes in cation and anion chemistry. The artificial neural network models were found to have superior predictive capabilities for irrigation suitability.

When a pollution event occurs, the water can be treated and reused for a variety of purposes. However, the specific purpose of the reuse will determine the levels of treatment recommended. This is a difficult task, especially as the number of water points to be treated is large. It is therefore necessary to find techniques for grouping wells that take into account the physicochemical and microbiological characteristics of each group in order to provide the optimum treatment required. To this end, multivariate statistical analysis, such as principal component analysis (PCA), were used to study the interactions between multiple factors ([5, 9, 10, 24, 25, 31, 32]). This method serves as a theoretical basis for other multidimensional statistical methods called factorial, which appear as special cases. The quality of the estimates it produces depends on the choice of the number of principal components used to reconstruct the initial data. When the number of components is greater than two, it is necessary to look at the individuals projected on all the planes for a good interpretation. This becomes tedious. Additionally, PCA is limited to linear correlations. Kernels PCA or hierarchical cluster analysis (HCA) are often used to overcome these problems. The HCA method was the most widely used for studying the physicochemical and microbiological characteristics of water ([22, 23, 36]).

Unlike all these studies, which have combined several multivariate statistical techniques, this study attempts to find links between the quality of well water and an epidemic that has claimed many lives. In other words, this study attempts to determine how poor water quality contributed to an epidemic. The aim of this paper is to perform a detailed and comprehensive study of well water quality using conventional multivariate analysis techniques. The aim is to find relationships and conclusions that can help determine the state of water quality using biological, physical and chemical indicators in order to prevent future epidemics in other regions. Multivariate statistical analysis, including PCA, correspondence factorial analysis (CFA) and self-organizing map (SOM), is applied to a data set comprising three microbiological parameters (escherichia coli (E.coli), enterococcus faecalis (E.faecalis) and thermotolerant coliforms (CTH)) and thirteen physicochemical parameters (chlorides \((Cl^{-})\), conductivity (Cond), total hydrotimetric degree (DHT), bicarbonate (\(HCO_{3^{-}}\)), ammonium (\(NH_{4^{+}}\)), nitrate (\(NO_{3^{-}}\)), nitrite (\(NO_{2^{-}}\)), hydrogen potential (pH), phosphates (\(PO_{4}^{3^{-}}\)), sulphates (\(SO_{4}^{2^{-}}\)), temperature (T), total alkalinity contents (TAC) and turbidity (Tur)) sampled in seventy-two wells in 2020 over four campaigns (long dry season, long rainy season, short dry season and short rainy season). This paper can be used as a guide for future studies of water quality using multivariate statistics.

2 Materials and methods for classical data analysis

2.1 Materials

2.1.1 Description of the study area

M’Pody is a village in the Anyama commune in the autonomous district of Abidjan in Côte d’Ivoire (Fig. 1). The geographical coordinates are \(5^{\circ }34'29''\) North latitude and \(4^{\circ }14'8''\) West longitude in DMS (Degrees, Minutes, Seconds) or 5.57472 and \(-\)4.23556 in decimal degrees. The universal transverse mercator (UTM) position is UM61 and the Joint Operation Graphics reference is NB30–10. Anyama covers an area of 114 \(km^{2}\) and its population is estimated at 325,209 inhabitants [29]. Natural vegetation has given way to intense agriculture. The highly developed culture of oil palms and rubber trees leads to maximum degradation of the natural environment [12]. The climate is equatorial, with four seasons in the annual cycle. A long rainy season from April to July followed by a short dry season from August to September; a short rainy season from October to November and a long dry season from December to March. Average annual rainfall varies between 1600 and 2500 mm. Humidity is the order of 80 to 90 percent. The study area is located in the onshore sedimentary basin to the north of the lagoon fault. The geological formations in the area are those of the ivorian coastal sedimentary basin (coarse sands, variegated clays, iron-bearing sands and sandstones, etc.) [33]. The hydrography of the area is composed of small rivers, the Niéké and the gbangbo, as well as several small non-permanent streams. The Niéké is a left bank tributary of the Agnéby river, which flows from north-east to south-west. The gbangbo flows in a north–south direction and empties into the Ebrié lagoon. The geological context of the study area makes it possible to define a single hydrogeological unit that contains groundwater: continuous aquifers. These aquifers are characteristic of the sedimentary basin. These are, the Quaternary aquifer, the Mio-Pliocene aquifer (Continental Terminal) and the Upper Cretaceous (Maestrichtian) aquifer ([14, 19]).

Fig. 1
figure 1

Location map of the study area

2.1.2 Equipment and sampling

The main measuring equipment consists of a Palintest photometer (Great Britain), a pH meter, a conductivity meter and a turbidity meter for physicochemical parameters, and a membrane filtration device for bacteriological parameters. Water sampling was carried out from the seventy-two wells in the village during four campaigns (long dry season, long rainy season, short dry season, short rainy season) of the year 2020. Samples were taken in 1000 ml polyethylene containers for physicochemical parameters and 500 ml containers for microbiological parameters. The reagents used were of analytical quality. The reagents used to measure chemical parameters were PALINTEST brand (Great Britain). BIORAD Rapid E. coli 2 Agar, Bile Esculin Azide (BEA) agar and Tryptone Sulfite Neomycin (TSN) agar were used to enumerate markers of faecal contamination ([11, 12]).

2.2 Methods for classical data analysis

Samples were taken in strict aseptic techniques to prevent any accidental contamination. Each sample was carried out in sterile flasks according to Jean Rodier’s recommendations [30]. Collected samples were stored in a cooler (4 \(^{\circ }\)C) and then transmitted to the laboratory on the same day for analysis. Physicochemical parameters were determined using electrochemical and spectrophotometric methods. Microbiological analysis was carried out using the membrane filtration method (100 ml on 0.45 m membrane). There are thirteen physicochemical parameters. They are, Chlorides \((Cl^{-})\), Conductivity (Cond), Total Hydrotimetric Degree (DHT), Bicarbonate (\(HCO_{3^{-}}\)), Ammonium (\(NH_{4^{+}}\)), Nitrate (\(NO_{3^{-}}\)), Nitrite (\(NO_{2^{-}}\)), Hydrogen potential (pH), Phosphates (\(PO_{4}^{3^{-}}\)), Sulfates (\(SO_{4}^{2^{-}}\)), Temperature (T), Total Alkalinity Contents (TAC), and Turbidity (Tur). There are three microbiological parameters. They are Escherichia coli (E. coli), Enterococcus faecalis (E. faecalis) and thermotolerant coliforms (CTH). For more details on the analysis of these parameters, see [12]. Descriptive analysis and multivariate analysis were performed using the two hundred and eighty-eight (288) samples. The analysis of the parameters is carried out on the average of the measurements of the physicochemical and microbiological parameters of the water samples from each well. Means determination was carried out using EXCEL 2010 software. PCA, CFA, analysis of variance (ANOVA), SOM and the location map of the study area were obtained using Python, R, GIMP and ArcGIS software.

2.2.1 Principal component analysis method

The context for PCA involves a data set with observations on p numerical variables, for each of n individuals. These data values define an \(n\times p\) data matrix \(Y=(Y_{j})_{(1\le j\le p)}\). The observation of the vector \(Y_{j}\) on individual \(1\le i\le n\) is \(Y_{ij}\). In most cases, the variables studied do not have the same unit of measurement. It is common practice to begin by standardizing the variables as in (1)

$$\begin{aligned} Z_{k}=(Z_{ik})_{(1\le i\le n)}=\left( \frac{Y_{ik}-\overline{Y}_{k}}{S_{k}}\right) _{(1\le i\le n)} \end{aligned}$$
(1)

where \(\overline{Y}_{k}\) and \(S_{k}\) are respectively the mean and the standard deviation of the variable \(Y_{k}\). The principle of PCA is to reduce the dimension of the initial data, by replacing the initial p variables with (\(q < p\)) new uncorrelated variables. These new uncorrelated variables are called the principal components of the data set, and denoted \((F_{k})_{(1\le k\le q)}\) (2)

$$\begin{aligned} \left\{ \begin{array}{rcl} F_{1} &{}=&{} a_{11}Z_{1}+a_{21}Z_{2}+\cdots +a_{p1}Z_{p} \\ &{}\cdots &{} \\ F_{k} &{}=&{} a_{1k}Z_{1}+a_{2k}Z_{2}+\cdots +a_{pk}Z_{p} \\ &{}\cdots &{} \\ F_{q} &{}=&{} a_{1q}Z_{1}+a_{2q}Z_{2}+\cdots +a_{pq}Z_{p} \end{array} \right. \end{aligned}$$
(2)

Principal components are linear combinations of the initial p variables that successively maximize variance. The total variance captured by all the principal components is equal to the total variance in the original data set. The first principal component captures the most variation in the data, but the second principal component captures the maximum variance that is orthogonal to the first principal component, and so on. Before analyzing the results of a PCA, the correlation matrix between the initial variables must be studied. This gives an initial idea of the correlation structure between these variables. This correlation matrix is then used to create a table of the percentages of variance explained corresponding to the different eigenvalues. This table also contains the associated cumulative percentages. It is used to select the q dimensions used to interpret the PCA. This technique is used to calculate the linear correlation coefficients between each initial variable and each selected factor.

Let \(p_{i}\) be the weighting of individual i, \(c_{i}^{k}\) the coordinate of individual i on the k-th principal component \(F_{k}\), the correlation of the variable \(Y_{j}\) with respect to \(F_{k}\) is given by formula (3)

$$\begin{aligned} cor(Y_{j}, F_{k})=\frac{1}{n}\sum _{i=1}^{n}p_{i}Y_{ij}\frac{c_{i}^{k}}{\sigma (F_{k})} \end{aligned}$$
(3)

where, \(\sigma ^{2}(F_{k})=\sum _{i=1}^{n}p_{i}(c_{i}^{k})^{2}=\lambda _{k}.\) This correlation is used to construct graphs of the variables. The study of these graphics leads to the significance of the principal component. Another tool for interpreting principal components is the notion of contribution defined by formula (4). The contribution of the variable \(Y_{j}\) to the variance of the \(F_{k}\) axis is defined by

$$\begin{aligned} ctr(Y_{j}, F_{k})=\frac{cor(Y_{j}, F_{k})^{2}}{\sum _{l=1}^{p}cor(Y_{l}, F_{k})^{2}}. \end{aligned}$$
(4)

The contribution is also defined for individual \(X_{i}\). The contribution of individual \(X_{i}\) to the dispersion of the \(F_{k}\) axis is defined by (5)

$$\begin{aligned} ctr(X_{i}, F_{k})=\frac{p_{i}(c_{i}^{k})^{2}}{\lambda _{k}}. \end{aligned}$$
(5)

2.2.2 Correspondence factorial analysis method

Correspondence analysis is a factorial method of multidimensional descriptive statistics. Its aim is to analyze the relationship between two qualitative variables. The graphical results of these two analyzes are then superimposed to produce one or more scatter plots. This graph combines the modalities of the two variables under study. This makes it possible to study the relationship between the two variables. In this system, proximity between observations or between variables is interpreted as strong similarity. Proximity between observations and variables is interpreted as strong relationship. This proximity between two qualitative variables \(X=(X_{i})_{(i\in I)}\) and \(Y=(Y_{j})_{(j\in J)}\) is studied on N individuals. The cardinal of I is noted n and that of J is noted p. The number of individuals having the modality i of X and the modality j of Y is noted by \(x_{ij}\). The contingency table is given by the matrix \((x_{ij})_{(1\le i\le n;1\le j\le p)}\) or \((f_{ij})_{(1\le i\le n;1\le j\le p)}\) with \(f_{ij}=\frac{x_{ij}}{N}.\) The column-profiles form a cloud of p points in space \(\mathbb {R}^{n}\) and the array of column-profiles is \(\frac{f_{ij}}{f_{.j}} =P(Y=j|X=i)\) where \(f_{.j}=\sum _{i=1}^{n}f_{ij} =P(Y=j)\), for \(j=1,\ldots ,p\). The associated marginal column profile is \(G_{C}=(f_{1.},\ldots ,f_{n.})\).

The \(\chi ^{2}\) distance between two profiles columns j and \(j'\) is

$$\begin{aligned} d_{\chi ^{2}}^{2}(j,j')=\sum _{i=1}^{n}\frac{1}{f_{i.}}\left( \frac{f_{ij}}{f_{.j}}-\frac{f_{ij'}}{f_{.j'}}\right) ^{2}. \end{aligned}$$
(6)

The \(\chi ^{2}\) distance between the profile column j and its marginal profile \(G_{C}\) is defined as follows

$$\begin{aligned} d_{\chi ^{2}}^{2}(j,G_{C})=\sum _{i=1}^{n}\frac{1}{f_{i.}}\left( \frac{f_{ij}}{f_{.j}}-f_{i.}\right) ^{2}. \end{aligned}$$
(7)

The total inertia of the cloud of profiles columns with respect to \(G_{C}\) is

$$\begin{aligned} I_{G_{C}}= & {} \sum _{j=1}^{p}f_{.j}d_{\chi ^{2}}^{2}(j,G_{C})=\sum _{i=1}^{n}\sum _{j=1}^{p}\frac{\left( f_{ij}-f_{i.}f_{.j}\right) ^{2}}{f_{i.}f_{.j}}\nonumber \\= & {} \frac{\chi ^{2}}{n}=\phi ^{2}. \end{aligned}$$
(8)

This total inertia is decomposed into a sequence of axes of decreasing importance, each representing a synthetic aspect of the relationship between the two variables, and then a representation of the rows and columns is provided in which the position of a point reflects its participation in the independence gap. The definition of the \(\chi ^{2}\) distance between two profiles lines, between the profile line and its marginal profile and the total inertia of the cloud of profiles lines are respectively similar to those defined in (6), (7) and (8).

2.2.3 ANOVA method

One-way analysis of variance is used to study the effect of a qualitative variable X called a factor on a continuous quantitative variable Y. It shows whether the mean of the quantitative variable is the same in the different groups [4]. The different values taken by the factor X are called level (or population). For factor X, it is assumed that there are k levels, k samples of respective sizes \(n_{1},\ldots ,n_{k}.\) The total number of samples is \(n=\sum _{i=1}^{k}n_{i}.\) The value of the variable \(Y=(Y_{ij})_{1\le i \le k; 1\le j\le n_{i}}\) is measured at each experiment. Then, the analysis of variance model is written as in (9)

$$\begin{aligned} \left\{ \begin{array}{rcl} Y_{ij} &{}=&{} m_{i}+\varepsilon _{ij},\;1\le i\le k\;;\;1\le j\le n_{i}\\ &{}=&{} \mu +\alpha _{i}+\varepsilon _{ij} \end{array} \right. \end{aligned}$$
(9)

with

  • \(\varepsilon _{ij}\sim N(0,\sigma ^{2})\),

  • \(\mu\) average effect,

  • \(\alpha _{i}\) effect of level i of factor X,

  • \(Y_{ij}\) observation of index j of level i of the factor X.

Constraints are, \(\sum _{i=1}^{k}n_{i}\alpha _{i}=0,\;\forall (i,j)\ne (k,l),\; \varepsilon _{ij}\) and \(\varepsilon _{kl}\) are independent. Then, the null and alternative hypotheses of the one-factor ANOVA are given by (10) or (11)

$$\begin{aligned} \left\{ \begin{array}{rcl} H_{0} &{}:&{} m_{1}=m_{2}=\ldots =m_{k}\\ H_{1} &{}:&{} \exists \, i,j\in \{1, \ldots , k\}\; \text {such as}\; m_{i}\ne m_{j} \end{array} \right. \end{aligned}$$
(10)

or,

$$\begin{aligned} \left\{ \begin{array}{rcl} H_{0} &{}:&{} \alpha _{1}=\alpha _{2}=\ldots =\alpha _{k}=0\\ H_{1} &{}:&{} \exists \,i\in \{1, \ldots , k\}\; \text {such as}\; \alpha _{i}\ne 0. \end{array} \right. \end{aligned}$$
(11)

The statistical test defined in (12) is used to determine the significance of the factorial variance in relation to the residual variance. This is the ratio test of these two variances, the formula for which is as follows:

$$\begin{aligned} F=F_{(k-1,n-k)}=\frac{SCF/(k-1)}{SCR/(n-k)}. \end{aligned}$$
(12)

The quantities used in this report are defined by: \(SCF=\sum _{i=1}^{k}\sum _{j=1}^{n_{i}}(\overline{Y_{i}}-\overline{Y})^{2}\), is the dispersion due to the factor and \(SCR=\sum _{i=1}^{k}\sum _{j=1}^{n_{i}}(Y_{ij}-\overline{Y_{i}})^{2},\) the residual dispersion; \(\overline{Y}=\frac{1}{n}\sum _{i=1}^{k}\sum _{j=1}^{n_{i}}Y_{ij},\) is the overall average of the observations and \(\overline{Y_{i}}=\frac{1}{n_{i}}\sum _{j=1}^{n_{i}}Y_{ij}\) the mean of level i of factor X. Under the assumptions of normality and homogeneity of the residuals (differences between the observations and the group means), the F statistic follows a Fisher distribution with \(k-1\) and \(n-k\) degrees of freedom. If the value of F is greater than the theoretical threshold value according to the Fisher distribution, with a given alpha risk (usually 5 percent), then the test is significant. In this case, the factorial variability is significantly higher than the residual variability. We conclude that the means are globally different. If these hypotheses are not verified, it is always possible to apply a transformation at the level of the responses (log for example), or to use a non-parametric ANOVA (Kruskal-Wallis test), or to carry out an ANOVA based on permutation tests.

2.2.4 Self-organizing map method

SOM is a method of classification, representation and analysis of relationships. It was defined by Teuvo Kohonen, in the 80’s, from neuromimetic motivations ([17, 18]). In practice, a Kohonen network is made up of N units arranged according to a certain topology. For each unit i in the network, a neighborhood of radius r denoted \(V_{r}(i)\) is defined. This network is then formed by all the units located at a distance less than or equal to r. Each unit i is represented in \(\mathbb {R}^{p}\) space by a vector \(C_{i}\) called weight vector or code vector. The state of the network at time t is given by \(C(t) = (C_{1}(t),C_{2}(t),\ldots ,C_{N}(t)).\) For a given state C and a given observation x, the winning class \(i_{0}(C,x)\) is the one whose code vector \(C_{(i_{0} (C,x))}\) is closest to the observation x in the sense of a certain distance. The winning class \(i_{0}(C,x)\) is defined in (13)

$$\begin{aligned} i_{0}(C,x)=arg\min _{i}\Vert x-C_{i}\Vert . \end{aligned}$$
(13)

For a given state C, the network defines an application \(\psi _{C}\) which associates to each observation x the number of its class. After convergence of the Kohonen algorithm, \(\psi _{C}\) respects the topology of the input space, in the sense that neighboring observations in the space \(\mathbb {R}^{p}\) are associated to neighboring units or to the same unit. The code vector construction algorithm is defined in (14) iteratively as follows:

  • At time 0, the N code vectors are randomly initialized,

  • At time t, the state of the network is C(t) and an observation \(x(t+1)\) is presented according to a probability distribution P,

$$\begin{aligned} \left\{ \begin{array}{rcl} i_{0}(C(t),x(t+1)) &{}=&{} arg\min \left\{ \Vert x(t+1)-C_{i}(t)\Vert ,\;1\le i\le N\right\} \\ C_{i}(t+1) &{}=&{} C_{i}(t)-\varepsilon (t)\left( C_{i}(t)-x(t+1)\right) ,\forall i\in V_{r(t)}(i_{0}) \\ C_{i}(t+1) &{}=&{} C_{i}(t), \;\forall i \;\text {not in}\; V_{r(t)}(i_{0}) \end{array} \right. \end{aligned}$$
(14)

where \(0\le \varepsilon (t)\le 1\) is the adaptation parameter and r(t) the radius of the neighborhoods at time t. After convergence of the algorithm, the n observations are classified into K classes according to the nearest neighbor method, relative to the distance chosen in \(\mathbb {R}^{p}.\) Graphical representations can then be constructed based on the network topology. For further details, please refer to [6] and [7].

3 Results and discussion

3.1 Descriptive statistics

The mean, maximum (max), minimum (min), median (med) and standard deviation (sd) were used to describe all the data corresponding to the sixteen (16) parameters studied for two hundred and eighty-eight samples (288). Means of all the parameters were compared to WHO [35] standards in Table 1. Temperature influences the rate of chemical and biological reactions. It affects the level of dissolved oxygen in the water. In the present study, water temperature varied from \(25.88\pm 0.83 \;^{\circ }C\) to \(29.08\pm 0.83 \;^{\circ }C\) with mean \(27.79\pm 0.83 \;^{\circ }C\). The pH is used to measure the acidity or basicity of a solution. It varied between \(4.22\pm 0.88\) to \(11.515\pm 0.88\) with mean \(5.21\pm 0.88\), which means that the water from the wells is acidic. In all this wells, the pH is outside the world health organization (WHO) permitted limit [6.5, 8.5]. The characteristics of the M’pody soil (coarse sands, ferruginous sands and sandstones, etc.) could explain the acidity of these waters. In line with WHO standards, these well waters should not be consumed without being treated. Electrical conductivity is the ability of an aqueous solution to conduct electric current. It determines all the minerals present in a solution. It varied between \(24.73\pm 121.12\, \mu S/cm\) to \(594\pm 121.12 \,\mu S/cm\) with mean \(157.776\pm 121.12 \,\mu S/cm\). This means that well water is generally poorly mineralized. Turbidity varied from \(2.71\pm 23.55\) to \(162.43\pm 23.55\) NTU with a mean \(22.41\pm 23.55\) NTU. Turbidity levels in well water are on average higher than the WHO standard. In well water, turbidity is caused by small particles in suspension of various natures, such as, clays and silts, microsands, bacteria, organic matter and mineral salts, etc. Most of the time, they are the result of leaching from the surrounding soil and therefore indicate a well that is poorly protected from run-off water. In addition, mean of \(NO_{3^{-}}\), \(NO_{2^{-}}\), \(NH_{4^{+}}\), \(PO_{4}^{3^{-}}\), \(Cl^{-}\), TAC, DHT, \(SO_{4}^{2^{-}}\) and \(HCO_{3^{-}}\) check WHO standards. Microbiological analysis of the well water showed the presence of germs. These microorganisms reached maxima of 7400 CFU/250 ml for thermotolerant coliform, 5650 CFU/250 ml for E. coli and 3913 CFU/250 ml for E. faecalis. The logical explanations for this situation of faecal pollution of the water could come, on the one hand, from the infiltration of septic tanks located near the wells and, on the other hand, from the run-off of waste water carrying human and animal faecal matter. These results are consistent with those of [12] and [15].

Table 1 Average concentrations of physicochemical and microbiological parameters in well water

3.2 Results of principal component analysis and correspondence factor analysis

PCA is used to extract information from a table of quantitative data of the type individuals\(\times\)variables to study the proximity between individuals (wells) on the one hand and the links between variables (parameters) on the other. Measuring the proximity between wells means determining which wells are similar in terms of physicochemical and microbiological parameters, in order to form groups of wells based on their proximity. Intuitively, two wells are close if their coordinates in \(\mathbb {R}^{p}\), the space of parameters, are close. In other words, if the observations made on the p parameters are close. To quantify this proximity, we need to associate a measure of this proximity with the space \(\mathbb {R}^{p}\). In other words, a measure of distance between the wells. Furthermore, PCA can also be used to obtain graphical representations of distances between individuals and correlations between variables. PCA is also a method of dimension reduction (construction of a small number of synthetic variables (axis) summarizing the initial variables as best as possible). In this study, the eigenvalue extraction method was applied to the correlation matrix (Fig. 2) to determine the principal components. The results are presented in Fig. 3 and Table 2. Combining the criteria of Kaiser, the scree plots and the proportion of variance explained, the number of factors to retain is five. Thus, in the analysis, only these first five principal components were chosen and the other components were omitted. It is very important to study the correlations between the new synthetic dimensions and the original variables. These correlation coefficients will finally be used to estimate the relative contributions (ctr) of each original variable (Table 3) in the construction of the principal components. All the wells are then projected into the different planes defined by these principal components. An extract from these projections is shown in Fig. 4.

Fig. 2
figure 2

Correlation matrix between the parameters

Fig. 3
figure 3

Principal components explain of the variance

Table 2 Eigenvalues and percentage of variances on each principal component in PCA
Fig. 4
figure 4

Projection of wells on the factorial plane (1, 2) in PCA

Table 3 Correlations and contributions of variables on the different component principal
Fig. 5
figure 5

Projection of variables on the factorial planes (1, 2) and (1, 3) in PCA

The factor loading classification method adopted by Liu et al. [21] is used to study the correlations between the variables and the returned principal components. In this classification, the load r is considered strong for \(|r| \ge 0.75,\) moderate if \(0.5 \le |r| <0.75\) and weak if   \(0<|r| < 0.5.\) Fig. 2 shows that, in general, the values of the correlation coefficients show natural physical, chemical and microbiological behavior. Further evaluation of these coefficients shows that the strongest correlations are observed between TAC and DHT (0.891), TAC and \(HCO_{3^{-}}\) (0.875), E.coli and CTH (0.96); moderate correlations between E.coli and E.faecalis (0.655), CTH and E.faecalis (0.665), Tur and DHT (0.632), \(Cl^{-}\) and Cond (0.525), TAC and Tur (0.515), \(NO_{3^{-}}\) and Cond (0.687), TAC and \(SO_{4}^{2^{-}}\) (0.672), \(SO_{4}^{2^{-}}\) and \(HCO_{3^{-}}\) (0.612), \(HCO_{3^{-}}\) and DHT (0.719). The other correlations have significantly low values.

The PCA extracted 5 principal components (Fig. 3) which accounted for 75.079 percent of the total variances with eigenvalues ranging from 1.131 to 4.412 (Table 2). PC1 accounted for 27.573 percent, PC2 accounted for 20.797 percent, PC3 accounted for 12.015 percent, PC4 accounted for 7.626 percent, while PC5 accounted for 7.068 percent. The parameters defining PC1 are TAC, DHT, \(SO_{4}^{2^{-}}\), \(HCO_{3^{-}}\), \(NO_{2^{-}}\) and Tur; PC2 are CTH, E.coli, E.faecalis, Cond, T, \(Cl^{-}\) and \(NO_{3^{-}}\); PC3 are CTH, E.coli, Tur. Cond, pH and \(NO_{3^{-}}\); PC4 are \(NO_{2^{-}}\), \(NH_{4^{+}}\), \(PO_{4}^{3^{-}}\), \(Cl^{-}\) and \(SO_{4}^{2^{-}}\) while PC5 is defined by pH, T, \(NO_{2^{-}}\) and \(NH_{4^{+}}\). The PC1 can be interpreted as metal cations (calcium, magnesium), hydroxides, bicarbonates, carbonates, and turbid water component. The outcome for the PC1 is consistent with those of the study of [37] and [38]. The second and third components indicate microbial components. The fifth component indicates turbid and acidic water.

The parameters are then projected onto the different factorial planes. They are correctly projected onto the different planes when the end of the projected vector approaches the unit circle. An extract of the projections of the 16 variables onto the factorial planes (1,2) and (1,3) is given in Fig. 5. Three groups of variables can be distinguished. The first group is made up of the parameters DHT, TAC, \(HCO_{3^{-}}\) and \(SO_{4}^{2^{-}}\). They are strongly correlated to the first axis. The second group is composed of the CTH, E.coli and E.faecalis parameters. They have a moderate correlation with the second axis. CTH and E.coli also have a moderate correlation with the third axis. While the third group is composed of the variables Cond, T, \(NO_{3^{-}}\) and \(Cl^{-}\). They have a moderate correlation with the second axis. This third group is opposed on the second axis to the second group. Taking into account Fig. 2, TAC, DHT, \(HCO_{3^{-}}\) and are strongly correlated. Using this natural property of water, the TAC measurement is used directly to estimate the DHT and \(HCO_{3^{-}}\) values of the water. This result is not consistent with those of [20, 28] and [34] who have shown that groundwater quality can be accurately predicted solely by measuring electrical conductivity. Moreover, the measurement of E.coli could be sufficient to predict water quality with regard to the parameters CTH and E.faecalis. The correlations obtained between the microbiological parameters studied are similar to those of [3] and [16]. Finally, the correlations obtained between the parameters (Cond, \(NO_{3^{-}}\) and \(Cl^{-}\)) and also with (\(SO_{4}^{2^{-}}\), Tur, DHT and TAC) are similar to the results of [23, 26] and [36].

Then, the proximity of the wells is studied in order to determine the wells that are similar in terms of physicochemical and microbiological parameters. This will make it possible to form homogeneous groups of wells. This takes into account the respective coordinates of the principal components (Table 4) and the CFA method, which is used to study wells and parameters simultaneously in order to highlight correspondences. The eigenvalue extraction method was chosen for this purpose. Using the proportion of variance explained (Table 5), the number of factors to be retained is two. Consequently, the analysis will be limited to this design. Figure 6 shows the position of the wells and the parameters studied. Table 6 shows the partial correlations and partial contributions of the physicochemical and microbiological parameters in relation to the factors. The first factor, which accounts for 49.077 percent of the total variance, has a strong positive correlation with Cond (0.884), \(NO_{3^{-}}\) (0.806) and \(CL^{-}\) (0.755); moderate correlation with pH (0.614), T (0.676), \(HCO_{3^{-}}\) (0.613), TAC (0.599) and weak correlation with DHT (0.483), \(SO_{4}^{2^{-}}\) (0.458), \(NH_{4^{+}}\) (0.388), \(PO_{4}^{3^{-}}\) (0.314). Parameters Cond, \(HCO_{3^{-}}\), TAC, \(CL^{-}\), \(NO_{3^{-}}\) and T contribute more to the inertia of this axis. The factor 1 represent physicochemical component presented in PCA study.

Table 4 Coordinates of the wells projected onto the principal components
Table 5 Eigenvalues and percentage of variances on each component principal in CFA
Table 6 Correlations and contributions between variables and factors
Fig. 6
figure 6

Projection of the wells and parameters studied on the first factorial plane in CFA

Factor 2 explains 30.7 percent of the total variance and is strongly correlated with E.faecalis (0.896), moderately correlated with CTH (0.575) and E.coli (0.506). The parameters E.faecalis, CTH and E.coli contribute more to the inertia of this axis. The factor 2 represent microbiological component presented in PCA study.

Wells P22, P23, P24, P25, P28, P42, P45, P48, P57, P58, P59, P61, P68, P70 and P71 (Table 4) are well projected onto the first factorial plane because their coordinates on axis 1 are large. These wells share a high frequency on the axis for the variables DHT, TAC, \(HCO_{3^{-}}\) and \(SO_{4}^{2^{-}}\). Wells P03, P07, P08, P10, P11, P12, P13, P16, P17, P23, P28, P29, P30, P31, P33, P36, P38, P44, P45, P48, P50, P51, P57, P57, P58, P61, P62, P63, P65, P66, P67, P68, P69, and P70 are also well projected on this plan but it is on axis 2 that their coordinates are large. These wells take on large values on axis 2 for the variables CTH, E.coli and E.faecalis. The wells P10, P12, P13, P16, P26, P55, P56, P61, P63, P65, P66, P67; P38, P45, P55, 58, P61; and P25, P38, P61 are well projected onto PC3, PC4 and PC5 respectively. The data makes it possible to characterize them. These wells share relatively high concentrations of certain parameters among all the parameters studied. For example, P25 has the highest pH (11.515); P26 (594 \(\mu S/cm\)) and P61 (545.25 \(\mu S/cm\)) have the greatest conductivities.

3.3 One-way ANOVA results

One-way analysis of variance is used to study the effect of wells on physicochemical and microbiological parameters. It shows whether the average for each parameter is the same in the different groups studied. Note that correlated parameters will have similar responses in the ANOVA. Based on PCA results, the result obtained with the Cond parameter will be similar to that obtained with \(NO_{3^{-}}\) and \(CL^{-}\). The same is true of the result obtained with the TAC parameter. It will be the same as that for DHT, \(SO_{4}^{2^{-}}\) and \(HCO_{3^{-}}\). Finally, the result obtained with the parameter E.coli will also be similar to that obtained with the parameter CTH and E.faecalis. Consequently, the parameters (Cond, E.coli and TAC) were selected for testing. Shapiro-Wilk’s normality test gave the following results. The \(p-value<2.2\times 10^{-16}\) for the Cond, TAC and E.coli parameters. This confirm the non-normality nature of this parameters. The non-parametric Kruskal-Wallis test is therefore necessary for the study. Significant differences (\(p-value=0.0071 < 0.05\)) were observed in E.coli between wells. Significant difference is observed in TAC (\(p-value=1.504\times 10^{-7}< 0.05\)). Electrical conductivity concentration shows significant differences between wells (\(p< 2.2\times 10^{-16}\)). This indicates that the wells have an effect on these parameters. It also means that the factors influencing the well parameters are different. Duncan’s multiple comparison test carried out on the Cond, TAC and E.coli parameters gave the results summarized in Table 7, Table 8 and Table 9 respectively. These results show significant differences between wells for the parameters studied. Table 7 shows sixteen significant levels for Cond. Nineteen significant levels for TAC are observed in Table 8. Eight significant levels for E.coli are observed in Table 9. With regard to the parameters studied, if three wells belong to different significant levels, namely a, ab and c for example, the well at level a is close to the well at level ab but different from the well at level b. Similarly, the well of level b is close to the well of level ab but different from the well of level a and so on. In other words, although they are in the same study area, the parameters of the well water evolve differently. The advantage of this classification is as follows. If we want to treat all the seventy-two wells for E.coli, these wells must be grouped into eight sub-groups. Each sub-group must be treated differently depending on the concentration level.

Table 7 Wells classified by concentration levels of the conductivity parameter
Table 8 Wells classified by concentration levels of the total alkalinity contents parameter
Table 9 Wells classified by concentration levels of the escherichia coli parameter

3.4 Classification of wells water samples by SOM

The concept of the SOM algorithm is to conduct a nonlinear classification of complicated data sets by recognizing similar patterns. In this work, the input layer consists of vectors representing seventy-two (72) wells, each of which contains sixteen (16) components representing the 16 physicochemical and microbiological parameters of the well water studied. The output layer is composed of 16 neurons (4 rows\(\times\)4 columns). This size was chosen for the output map after convergence of the algorithm. Figure 7 shows the role of parameters in defining the different areas of the topological map and Table 10 shows the wells of each node. With the exception of the south-western part, almost the entire map is characterized by parameters in green and yellow (Tur, Cond, pH, T, \(NO_{3^{-}}\), \(NO_{2^{-}}\), \(NH_{4^{+}}\), \(PO_{4}^{3^{-}}\), \(Cl^{-}\), TAC and DHT). The south-western part is characterized by the variables in pink (E.coli, CTH and E.faecalis). A graph (Fig. 8) of each variable is produced to show the correlations between them and this graph can help to summarize the effective parameters of the wells in each node. The SOM component planes of the data set allow distinguishing two types of colors; dark red cells represent high values, while blue cells represent low values for each parameter [27]. The similar colors between the variables correspond to a positive correlation. This can be illustrated between the variables \(HCO_{3^{-}}\), DHT, \(SO_{4}^{2^{-}}\) and TAC. There is also a positive correlation between these parameters and Tur. Cond and \(NO_{3^{-}}\) are positively correlated. There is also a positive correlation between E.coli, E.faecalis and CTH variables. These results confirm those obtained previously. On the other hand, T, \(PO_{4}^{3^{-}}\), \(Cl^{-}\), \(NO_{2^{-}}\), \(NH_{4^{+}}\) and pH vary independently of each other. The same idea can be expressed by using a dispersion indicator such as variance. The variance weighted by the number of nodes is calculated. It then becomes possible to rank their role. The important variables (because they induce the strongest contrasts) appear in first position (Table 11). The parameters \(PO_{4}^{3^{-}}\) and \(NH_{4^{+}}\) are the least influential. This means that the conditional averages tend to be homogeneous across the map. These results confirm what we have seen in the various graphs. A detailed summary of the parameters for each well is presented in (Fig. 8). The dark red nodes represent high values of each parameter.

Wells P01, P03, P05, P06, P07, P08, P10, P11, P12, P13, P15, P16, P17, P18, P21, P22, P23, P24, P25, P28, P29, P30, P37, P42, P44, P45, P47, P48, P52, P56, P57, P58, P59, P62, P65, P67, P68, P69, P70, P71 and P72 are mainly characterized by high Tur concentrations (13.9 NTU, 162.43 NTU). Wells P08, P23, P24, P26, P27, P28, P31, PP34, P35, P38, P40, P42, P43, P45, P52, P55, P56, P59, P60, P61, P62, P63, P64, P66, P68 and P71 are mainly characterized by high Cond concentrations (172.875 \(\mu\)S/cm, 594 \(\mu\)S/cm). Wells P01, P03, P05, P08, P22, P23, P24, P25, P28, P29, P30, P39, P42, P44, P45, P46, P59, P61, P68, P69, P70, P71 and P72 are mainly characterized by high pH (5.24, 11.51). Wells P15, P22, P26, P27, P35, P36, P37, P39, P49, P58, P60, P62, P63, P64, P65, P66, P68, P69 and P72 are mainly characterized by high T (28.425 \(^{\circ }\)C, 29.08 \(^{\circ }\)C). Wells P23, P26, P42, P43, P55 and P66 are mainly characterized by high \(NO_{3^{-}}\) concentrations (27.018 mg/l, 44.16 mg/l). Wells P25, P26, P28, P38, P42, P45, P55, P58, P59, P61, P68, P70 and P71 are mainly characterized by high \(NO_{2^{-}}\) concentrations (0.11 mg/l, 0.54 mg/l). Wells P38, P61 and P72 are mainly characterized by high \(NH_{4^{+}}\) concentrations (1.1225 mg/l, 3.82 mg/l). Wells P01, P02, P03, P04, P05, P06, P07, P08, P10, P11, P12, P13, P14, P15, P16, P17, P18, P19, P20, P21, P22, P23, P24, P26, P27, P28, P29, P30, P31, P32, P35, P36, P37, P38, P39, P40, P41, P42, P43, P44, P45, P47, P48, P49, P50, P53, P54, P55, P56, P57, P58, P59, P60, P61, P62, P63, P64, P65, P66, P67, P68, P69, P70, P71 and P72 are mainly characterized by high \(PO_{4}^{3^{-}}\) concentrations (0.0375 mg/l, 0.75 mg/l). Wells P27, P29, P34, P40, P42, P49, P61, P63, P66 and P68 are mainly characterized by high \(Cl^{-}\) concentrations (27.975 mg/l, 44.7 mg/l). Wells P24, P45, P58, P68, P70 and P71 are mainly characterized by high TAC (147.5 mg/l, 216.25 mg/l). Wells P45, P58 and P68 are mainly characterized by high DHT concentrations (86.25 mg/l, 106.25 mg/l). Wells P24, P25, P28, P51, P59, P68 and P71 are mainly characterized by high \(SO_{4}^{2^{-}}\) concentrations (21.25 mg/l, 40.5 mg/l). Wells P12, P23, P24, P28, P30, P31, PP39, P40, P42, P44, P45, P48, P49, P56, P57, P58, P59, P62, P68, P70, P71 and P72 are mainly characterized by high \(HCO_{3^{-}}\) concentrations (93.75 mg/l, 216.25 mg/l). Wells P10, P12, P13 and P16 are mainly characterized by high CTH concentrations (4367.5 UFC/250 ml, 7400 UFC/250 ml). Wells P10, P12, P13 and P16 are mainly characterized by high E.coli concentrations (3868 UFC/250 ml, 5650 UFC/250 ml). Wells P01, P02, P03, P04, P05, P06, P07, P08, P09, P10, P11, P12, P13, P14, P15, P16, P17, P19, P21, P22, P23, P24, P25, P27, P28, P29, P30, P31, P32, P33, P34, P36, P37, P38, P39, P40, P41, P42, P43, P44, P45, P46, P47, P48, P49, P50, P51, P52, P53, P56, P57, P58, P59, P62, P63, P64, P65, P66, P67, P68, P69, P70, P71 and P72 are mainly characterized by high E.faecalis concentrations (84.5 UFC/250 ml, 3913 UFC/250 ml).

Fig. 7
figure 7

Graph of the nature of the different zones on the map in relation to the parameters

Table 10 Distribution of wells in each node
Fig. 8
figure 8

Graph of the nature of the different zones on the map in relation to each parameter

Table 11 Relevance of parameters

Once the Kohonen map has been obtained, the HCA is used to group the seventy-two wells based on the similarity of the responses to physicochemical and microbiological parameters. Ward’s method and the complete method give better results than other existing methods. Ward’s method (Fig. 9) gives three clusters as in [8]. The Complete method (Fig. 10) gives five clusters. Cluster1 of the Complete method is formed by well 61. It is the only well which has a good projection on all the factorial planes of the PCA (Table 4). Cluster 2 of the complete method is cluster 2 of Ward’s method. Cluster 3 and cluster 4 of the complete method form Ward’s cluster 1. Finally, cluster 5 corresponds to Ward’s cluster 3. The results obtained using Ward’s method will therefore be used for the analysis. The clusters obtained are very close to those obtained by PCA.

Fig. 9
figure 9

Dendrogram of wells from M’pody village obtained using the Ward method

Fig. 10
figure 10

Dendrogram of wells from M’pody village obtained using the Complete method

Moreover, cluster 1 contains seven wells (P24, P28, P45, P58, P68, P70 and P71) and represents 09.72 percent of the total number of the wells. These wells have the largest coordinates on PC1. It is mainly characterized by high concentrations of microbiological elements (CTH [362 UFC/250 ml, 2365 UFC/250 ml]; E.coli [180 UFC/250 ml, 2065 UFC/250 ml]; E.faecalis [716 UFC/250 ml, 3542.5 UFC/250 ml]). These waters are very turbid [13.54 NTU, 162.43 NTU]; acidic [5.09, 6.48], with conductivity [62.6 \(\mu\)S/cm, 430.75 \(\mu\)S/cm] and a temperature [27.875 \(^{\circ }\)C, 28.525 \(^{\circ }\)C] higher than the WHO standard. They verify the WHO standards with regard to the parameters: \(NO_{3^{-}}\), \(NO_{2^{-}}\), \(NH_{4^{+}}\) except P68 (1.02 mg/l), \(PO_{4}^{3^{-}}\) except P45 (0.7525 mg/l), \(Cl^{-}\), TAC, DHT, \(SO_{4}^{2^{-}}\) and \(HCO_{3^{-}}\).

Cluster 2 contains four wells (P10, P12, P13, P16) and represents 05.56 percent of the total number of the wells. These wells have the largest coordinates on PC2. It is mainly characterized by very high concentrations of microbiological elements (CTH [4367.5 UFC/250 ml, 7400 UFC/250 ml]; E.coli [3867.5 UFC/250 ml, 5650 UFC/250 ml]; E.faecalis [84.5 UFC/250 ml, 3913 UFC/250 ml]). These waters are turbid [15.7 NTU, 20.7475 NTU] and acidic [4.625, 5.19], with very low conductivity [24.725 \(\mu\)S/cm, 67.025 \(\mu\)S/cm] and a temperature [26.1 \(^{\circ }\)C, 26.3 \(^{\circ }\)C] higher than the WHO standard. They verify the WHO standards with regard to the parameters \(NO_{3^{-}}\), \(NO_{2^{-}}\), \(NH_{4^{+}}\), \(PO_{4}^{3^{-}}\), \(Cl^{-}\), TAC, DHT, \(SO_{4}^{2^{-}}\) and \(HCO_{3^{-}}\). This cluster represent the microbiological component (F2) previously described in PCA/CFA study.

Cluster 3 includes the largest number of wells (sixty-one) and represents 84.72 percent of the total wells. It is characterized by concentrations of microbiological elements with high variability (CTH [4.25 UFC/250 ml, 2730 UFC/250 ml]; E.coli [2 UFC/250 ml, 2329.5 UFC/250 ml]; E.faecalis [3.5 UFC/250 ml, 3781.25 UFC/250 ml]). These waters are turbid [2.705 NTU, 83.775 NTU], more acidic [4.22, 6.065], with conductivity [25.8525 \(\mu\)S/cm, 440.95 \(\mu\)S/cm] and a temperature [25.875 \(^{\circ }\)C, 28.675 \(^{\circ }\)C] higher than WHO standard. They verify the WHO standards with regard to the parameters: \(NO_{3^{-}}\), \(NO_{2^{-}}\), \(PO_{4}^{3^{-}}\) except P55 (0.5425 mg/l), \(Cl^{-}\), TAC, DHT, \(SO_{4}^{2^{-}}\) and \(HCO_{3^{-}}\). With regard to ammonium, 18.03 percent of wells do not comply with WHO standards. These are the wells, P31 (0.76 mmg/l), P35 (0.565 mg/l), P38 (1.1225 mg/l), P40 (0.675 mg/l), P52 (0.542 mg/l), P55 (0.5525 mg/l), P57 (0.6375 mg/l), P61 (3.8175 mg/l), P62 (0.64 mg/l), P63 (0.52 mg/l) and P72 (1.215 mg/l). Cluster 1 and cluster 2 represent physicochemical component (F1) presented in PCA/CFA study.

4 Conclusion

A multivariate statistical approach was applied to a database comprising sixteen (16) physicochemical and microbiological parameters carried out on two hundred and eighty-eight (288) well water samples from the village M’pody. This technique is very promising, because it makes it possible to understand water quality while highlighting the different correlations that exist between the parameters studied. The study showed that turbidity, conductivity, hydrogen potential and temperature did not meet WHO standards. In addition, the water from all the wells is polluted with faecal bacteria (E.coli, E.faecalis and CTH). It is certainly this faecal pollution that is at the root of this diarrhoea epidemic. It indicates that poor well maintenance is the main factor controlling microbiological pollution of well water in the study area. The logical explanations for this situation could come, on the one hand, from the infiltration of septic tanks located near the wells and, on the other hand, from the run-off of waste water carrying human and animal faecal matter. To prevent epidemics, populations who use well water or surface water should use approved technicians for the construction of latrines. Erect perimeters to protect water points. Learn water treatment techniques. For example, filtering water through layers of granular materials or on granular activated carbon. However, all the physicochemical parameters \(NO_{3^{-}}\), \(NO_{2^{-}}\), \(NH_{4^{+}}\), \(PO_{4}^{3^{-}}\), \(Cl^{-}\), TAC, DHT, \(SO_{4}^{2^{-}}\) and \(HCO_{3^{-}}\) comply with WHO standards. The ANOVA method showed that there were significant differences between the wells for the parameters TAC, Cond and E.coli, due to the specificity and characteristics of each well. In addition, the ANOVA confirmed that human activities were the main factors influencing the physicochemical and microbiological parameters of the wells studied. PCA, CFA and SOM methods are the multivariate analysis techniques used to highlight certain specificities in the structure of the data. Five principal components identified by PCA accounted for 75.079 percent of the total variance. The PCA, CFA and HCA identified the structure of the wells and deduced the main factors controlling the physicochemical and microbiological parameters of the water in these wells. With regard to the CFA, two main factors were identified. The first factor was identified as the physicochemical component with 49.077 percent contribution and the second with 30.717 percent contribution was related to the microbial load. The physicochemical component is mainly formed by the parameters Cond, \(HCO_{3^{-}}\), TAC, \(CL^{-}\), \(NO_{3^{-}}\), pH, T and the microbial component is linked to E.coli, E.faecalis and CTH. The results of the PCA/FCA are broadly similar to those obtained by applying the Ward and complete methods. However, there are some additional differences, due to the specificity of each method. This study also showed that measuring DHT is therefore sufficient to predict water quality in terms of TAC, \(HCO_{3^{-}}\), \(SO_{4}^{2^{-}}\). Similarly, the measurement of E.coli could be sufficient to predict water quality with regard to the parameters CTH and E.faecalis. What’s more, major difficulties are often encountered when using traditional PCA/CFA methods. An individual who is poorly represented but whose contribution is significant is eliminated from the analysis (extra individual). On the other hand, there are individuals whose contribution is too large and whose reliability is called into question. In this case, a new study is carried out. To overcome these difficulties, we plan to replace traditional PCA/CFA distances with robust distances such as the Hellinger distance. Subsequently, in order to implement effective planning and support methods for sustainable well water management, multiple linear regression and multi-layer perceptron models can be used to predict the dependent parameters. In practical terms, E.coli can be predicted from CTH and E.faecalis; TAC from DHT, \(SO_{4}^{2^{-}}\), and \(HCO_{3^{-}}\); Cond from T, \(NO_{3^{-}}\) and \(Cl^{-}\).