Combination of geostatistics and self-organizing maps for the spatial analysis of groundwater level variations in complex hydrogeological systems

Successful modelling of the groundwater level variations in hydrogeological systems in complex formations considerably depends on spatial and temporal data availability and knowledge of the boundary conditions. Geostatistics plays an important role in model-related data analysis and preparation, but has specific limitations when the aquifer system is inhomogeneous. This study combines geostatistics with machine learning approaches to solve problems in complex aquifer systems. Herein, the emphasis is given to cases where the available dataset is large and randomly distributed in the different aquifer types of the hydrogeological system. Self-Organizing Maps can be applied to identify locally similar input data, to substitute the usually uncertain correlation length of the variogram model that estimates the correlated neighborhood, and then by means of Transgaussian Kriging to estimate the bias corrected spatial distribution of groundwater level. The proposed methodology was tested on a large dataset of groundwater level data in a complex hydrogeological area. The obtained results have shown a significant improvement compared to the ones obtained by classical geostatistical approaches.


Introduction
Water scarcity is a major global problem, and it is expected to increase significantly in the near future. Projections for the Mediterranean region indicate a gradual decline in runoff and significant changes in surface and groundwater availability (García-Ruiz et al. 2011;Hartmann et al. 2014). The importance of groundwater in mitigating and buffering the problems associated with water scarcity in the Mediterranean region has been clearly recognized, especially in light of the increasing impacts of climate change (Jomaa Seifeddine et al. 2021; Smerdon 2017; Thomas and Famiglietti 2019). The development of adequate modeling tools is recommended in informing precautionary water resources management plans and allowing adaptation to the projected changes (Garrote et al. 2015;Wada et al. 2010).
Large scale groundwater modelling is complex due to the various boundary conditions and the hydrogeological inhomogeneities that usually exist. Geostatistical methods have often been used as a surrogate approach to model the groundwater level spatial variability. However, in complex groundwater systems with inhomogeneous aquifers, geostatistical analysis of monitored groundwater levels is challenging because stationarity is not guaranteed, and spatial dependence of observations cannot be reliably estimated. Furthermore, in the case of areas with rich monitoring networks, the availability of a large dataset of observations requires an efficient preprocessing algorithm to organize the dataset based on its characteristics.
In this work, a geostatistical analysis approach by means of Ordinary Kriging (OK) in combination with a machine learning algorithm, Self-Organizing Maps (SOM), was applied (a) to obtain stationary groundwater level data for producing a more stable variogram that reaches a plateau (Goovaerts 1997), leading to efficient estimation of observations spatial dependence (b) to develop reliable spatial maps of groundwater level variability in a largescale hydrogeological system of complex aquifers, and (c) to process a large dataset of 2524 wells efficiently using geostatistical analysis. The proposed approach represents an innovative approach in geostatistics by using the original SOM technique to classify similar neighboring data to create stationary neighborhoods and perform geostatistical analysis employing the OK method. OK and SOM are combined into a GeostatSOM hybrid approach.
Geostatistics has been successfully applied in groundwater-related studies (Cushman and Tartakovsky 2016;Kitanidis 1997;Panagiotou et al. 2022), and in combination with data science techniques to improve groundwater level spatial variability estimation (mapping). An indicative list of such combined research works follows: Hoogland et al. (2010) applied the multiple linear regression technique in terms of areal data to interpolate by means of Regression Kriging (RK) groundwater depths with improved accuracy, Manzione and Castrignanò (2019) employed multivariate data fusion methodologies along with OK to predict groundwater levels, Theodoridou et al. (2017) perform spatial analysis of groundwater levels using OK employing a Fuzzy Logic approach, Varouchakis and Hristopulos (2019) applied an exponential weighted moving average process to detrend groundwater level observations and improve its spatial analysis utilizing RK, and  interpolated groundwater levels using a Bayesian kriging approach for advanced accuracy.
Several studies have shown that the SOM technique can be successfully applied to water resources topics e.g., in clustering and classifying large datasets of precipitation, runoff and drought to support watershed and large-scale hydrologic modelling approaches (Kalteh et al. 2008;Markonis et al. 2018), organising large collections of hydrometeorological data for streamflow and precipitation prediction (Hsu and Li 2010;Toth 2009) and rainfallrunoff modelling using satellite products by classifying satellite imagery data (Farzad and El-Shafie 2017;Nourani et al. 2013).
SOM can be successfully combined with geostatistics in spatial analysis applications to classify the properties of large datasets (Kanevski 2013); it has been successfully applied in soil science to express the nonlinear relationships between soil organic matter and correlated factors in terms of regression kriging (Huang et al. 2017). However, to our knowledge, this is the first time such a hybrid approach has been applied to groundwater level spatial analysis. A similar methodological approach called Geo-SOM (Henriques et al. 2012) considers the coordinates of available observations to perform data clustering only.
The island of Crete in Greece is used as a case study due to its rich groundwater level monitoring network and complex hydrogeological system to verify the proposed methodology. The island's groundwater resources have been significantly affected by overexploitation and are expected to be further affected by the climate change projections for the Eastern Mediterranean (Founda et al. 2019;Kim et al. 2019;Koutroulis et al. 2013; Special water secretariat of Greece 2017).
This work could be a guide for analogous applications in areas with complex hydrogeologic systems and large data sets where physically based models require a significant amount of information and conventional geostatistical approaches are not directly applicable. Thus, scientists and policymakers can apply it to determine and understand the state of the aquifer and its variability in complex systems identifying areas where groundwater management interventions are needed.
Apart from the introduction section, where background information on the proposed methodologies and its relativeness to the topic are presented, the remainder of the paper is formed in three sections. In materials and methods section the case study and data availability is discussed, and the mathematical background of the proposed methods and its implementation is presented. The results and discussion section follows, where the proposed research findings are described and discussed, and finally, in the last section the concluding remarks are highlighted.

Study area and data availability
The island of Crete is located in the southeastern part of the Mediterranean basin (Fig. 1). Hydrogeologically, it consists of a mixed sequence of porous and karstic aquifers. The most important hydrogeologic systems in terms of capacity and supply capabilities are the karst aquifers formed in carbonate rocks. These are followed by alluvial aquifers which are generally intensively exploited due to their relatively shallow depth. Finally, low groundwater capacities are found in rocks such as phyllites, schists, ophiolites which are important for water consumption as they meet the water needs of high-altitude settlements. The hydrogeological units of Crete have an area of 6.303 km 2 compared to the total area of Crete which is 8.335 km 2 . The remaining area of about 2000 km 2 relates mainly to outcrops of phyllite-quartzite and flysch and small occurrences of carbonate and granular formations (Neogene-Quaternary) which are not related to the main hydrogeological units of the water district (Special water secretariat of Greece 2017).
In the 1980s, an extensive network of pumping stations was installed in Crete and dry cultivation was converted to drip-irrigated. As a result, productivity has increased, but this overuse has led to a dramatic decline in groundwater levels in many aquifers over the past thirty years. In the water district of Crete, although a number of reservoirs have been built recently, the main source of irrigation has always been groundwater (drilling of natural sources) In the most recent Water Resources Management Plan prepared under the Water Framework Directive for the Crete Water District, the quantitative and qualitative status of the island's groundwater bodies was assessed in addition to the surface water catchments. Out of the 91 groundwater bodies, 9 are of poor-quality, and 10 are significantly depleted. Due to their qualitative and quantitative characteristics, 11 groundwater bodies require special attention (Special water secretariat of Greece 2017). In addition, two recent studies confirmed the findings by assessing the recharge potential of these groundwater bodies (Matiatos and Wassenaar 2019;Varouchakis et al. 2018).
In this work, the monitored biannual average data of 2524 wells (Special water secretariat of Greece 2017, 2020) are processed for the hydrological year 2017/ 18 (Table 1), using the GeostatSOM approach to estimate the spatial variability of the water table in a complex hydrogeological system of aquifers (Fig. 1).

Spatial statistics
The classical interpolation method Ordinary Kriging (OK) (Cressie 1990;Krige 1951;Matheron 1971) was used in this work to combine SOM with geostatistics. Kriging provides an estimate of the variable and the error variance of the corresponding estimate (the associated uncertainty). In ordinary kriging, the error variance (1) depends on the variogram model (spatial interdependence); the estimation accuracy depends on the complexity of the spatial variability of the random field as modeled by the variogram, (2) depends on the data configuration and its distances from the location being estimated, (3) is independent of the data values; for a given variogram model, two identical data configurations give the same variance regardless of their values and (4) the error variance is zero at the data locations and increases in distance from the data while it reaches a maximum value in the extrapolation situation (Goovaerts 1997;Kitanidis 1997).
However, for highly skewed data, a Gaussian transformation approach is required for Kriging error variance to be reliable as it is possible to create confidence intervals (Clark and Harper 2000;Dowd 2018). Kriging maps variability only up to the second order moment (covariance); therefore, the random field of the transformed variables must be Gaussian to derive unbiased estimates at unsampled locations (Deutsch and Journel 1992;Kitanidis 1997;Varouchakis 2021). OK is known to be optimal when the data have a multivariate normal distribution, and the true variogram is known. Therefore, data transformation is required before kriging to normalize the data distribution, suppress outliers, and thus improve variogram estimation (Armstrong 1998;Deutsch and Journel 1992). Estimation is performed in the Gaussian domain; before the estimates are transformed back to the original domain. However, the back transformation of the data may lead to biased results (Belitz and Stackelberg 2021; Schabenberger and Gotway 2005). Therefore, in this work the Transgaussian Kriging method in terms of the OK predictor is applied that corrects the retransformation bias (Cressie 1993). Transgaussian Kriging (TGK) is recommended for geostatistical analysis of data that require a Gaussian transformation (Schabenberger and Gotway 2005) and has been used successfully with groundwater data (Varouchakis et al. 2012). It employs the Box-Cox transformation method. For a given data set of measurement values Z at corresponding measurement locations s, ZðsÞ, a nonlinear normalizing transformation is expressed as where YðsÞ follows the multivariate Gaussian distribution. Implicitly,ZðsÞ ¼ uðYðsÞÞ, where uðÁÞ ¼ g À1 ðÁÞ is a one-to-one, twice-differentiable function. It is also assumed that YðsÞ is an intrinsically stationary random field with mean m Y and variogram c Y r ð Þ: For an unknown m Y , the predictor OK,Ŷ OK ðs 0 Þ, is used to predict Yðs 0 Þ. An estimate of Zðs 0 Þ is then given bŷ where uðÁÞ is the inverse of the transformation function. However, this leads to a biased predictor if uðÁÞ is a nonlinear transformation. A biascorrecting approximation is the trans-Gaussian predictor (Cressie 1993): wherem Y is the OK estimate of m Y , l Y is the Lagrange multiplier of the OK system, u 00 ðÁÞ is the second-order derivative of the inverse transformation function, and r 2 OK;Y ðs 0 Þ is the OK variance. When the Box-Cox normalizing transformation is used, as here, the functions uðÁÞ and u 00 ðÁÞ have the following form (Cressie 1993 where k is the power exponent of the Box-Cox method (Box and Cox 1964). The Spartan variogram function was used because it has been successfully applied in spatial and space-time geostatistical data analysis of groundwater level data (Ruybal et al. 2019;Hristopulos 2013, 2019). To evaluate the efficiency of the proposed method it was compared with the well-known Regression Kriging method in terms of auxiliary variables such as coordinates and elevation, Universal Kriging and Ordinary Kriging.
Statistical metrics such as the Mean Absolute Relative Error, the Mean Absolute Error and the Correlation Coefficient (R) were employed to evaluate the comparison of the methods.

Self-organizing maps
Self-organizing maps (or Kohonen networks) are an unsupervised machine learning technique (Kohonen 2001b). It is used for dimensionality reduction of data, clustering of complex datasets, local data similarity detection and pattern identification (Bowden et al. 2005;Hsu and Li 2010;Kohonen 2013;Richardson et al. 2003). SOM transforms complex statistically based relationships of high-dimensional data into simpler spatial or temporal relationships that can be represented as an easily interpretable map of a lower dimension. In a sense, SOM can be viewed as a clustering technique, that leads to the identification of data groupings (clusters), i.e., measurements (vectors) in the training set. However, it is different from the other clustering methods, which result in assigning each measurement location to one of the clusters, as is the case, for example, with the widely used k-means clustering algorithm. SOM yields a set of nodes (typically arranged in the form of a rectangular or hexagonal two-dimensional grid) whose positions (in terms of Euclidean distance) would be close to groups of measurement values. A trained SOM can then be used as a classifier: it classifies a vector from the input space by finding the node (called the best matching unit, BMU) that is closest to that input vector. Initially, the SOM nodes, forming a grid, are randomly located and gradually move towards the groups of data measurements during the iterative training process.
The algorithm below (Kohonen 2001a) presents the basic steps of training the SOM network consisting of a certain number of output nodes. A data set ZðsÞ of n locations, s 1 ; . . .s n f gin k-dimensional space, is considered: Self-organizing maps learn to classify input vectors according to how they are grouped in the input space. They learn to recognize neighboring sections of the input space. Thus, self-organizing maps learn both the distribution and topology of the input vectors they are trained on. The nodes in the layer of a SOM are arranged originally in physical positions according to a topology function, such as a grid or hexagonal in such a way that topological properties in the input training data are preserved (Augustijn and Zurita-Milla 2013;Kohonen 2013). We used the Kohonen R package (Wehrens and Buydens 2007) to train several SOMs following these steps: 1. The size (number of nodes, including number of rows and columns) and type (rectangular) of the SOM lattice were chosen. 2. Each node was assigned a random vector of weights (m k ) with the same dimensionality as the input data. 3. Random data samples (input vector) were iteratively presented to the low-dimensional lattice to identify the best matching unit, BMU, which is the node that minimizes the Euclidean distance between the weight vector and the data sample at hand. This iterative process is known as training the SOM and each iteration (t) is used to update the weight vector of the BMU and the neighbouring nodes according to: in which m k is a k-dimensional weights vector, a(t) is the learning rate and x 2 Z k ðsÞ is a randomly chosen input vector from the training dataset (Augustijn and Zurita-Milla 2013; Kohonen 2013; Wehrens and Buydens 2007). The aforementioned methodological steps are summarized in a flowchart (Fig. 2) Thus, when a vector x is presented, the weights of the winning node and its close neighbors move towards x. The winning node is rewarded with becoming more like the sample vector. Consequently, after many iterations, neighboring nodes have learned vectors similar to each other. For SOM training, the weight vector associated with each node moves to become the center of a cluster of input vectors. After training the SOM, the next step in the SOM process is mapping the data onto the trained SOM network, identifying for each node location in the topology how many of the observations are associated with each node (cluster centers). This process is repeated for each input vector for a (usually large) number of cycles. The network winds up associating nodes with the input data set. This is delivered by the SOM Sample Hits plot (Kohonen 2013;Wehrens and Buydens 2007).
The quality and accuracy of SOM results can be determined by two performance indices, quantization error, and topographic error. The quantization error is the average distance between the data vector and the best matching unit (BMU) (the closest node). The quantization error indicates how accurately the given input patterns are represented by a BMU. The smaller this index, the better the representation. The topographic error is related to the quality of the map topology: the quality is high (and thus the error low) if the nodes adjacent on the grid correspond to similar input patterns (Beale and Jackson 1990). It is calculated as the ratio of all vectors, where the first and second best BMUs are not adjacent on the grid.

Implementation and analysis
In this study the two-dimensional rectangular SOM was used. The SOM method was implemented in R using the related package (Wehrens and Buydens 2007) by the authors, while the code originally developed in R which combines Transgaussian kriging with SOM (TGK-SOM), and forms the GeostatSOM model, was developed using the geoR package (Ribeiro et al. 2007). This package was also used to perform Regression Kriging, Universal Kriging and Ordinary Kriging. The cross-validation measures used to compare the true and estimated groundwater level values using the different methods are: Mean absolute error (MAE): Fig. 2 Flowchart of the unsupervised SOM algorithm (Kohonen 2001a) based on Eq. (4); m k is a k-dimensional weights vector, x is an input vector from the dataset, BMU is the Best Matching Unit, t the iterations and t max the maximum iterations number Mean absolute relative error (MARE): Linear Correlation Coefficient: where z 2 Z; z Ã ðs i Þ is the estimated level at location s i , zðs i Þ the observed value, zðs i Þ denotes the spatial average of the data and z Ã ðs i Þ the spatial average value of the estimates. The SOM method was used in this work to help identify the kriging estimation neighborhood. As is well known, the correlation length parameter calculated from the variogram estimation is used to define the estimation neighborhood. However, in many cases the theoretical variogram model calculation is the best possible without perfectly modeling the experimental variogram, which is based on the measurements and usually does not have a clear sill, which is an important factor in determining a representative correlation length. This is usually the case when long-distance spatial variations exist, and measurement values with high variability lead to significant fluctuations in variogram values. Another reason for the experimental variogram not reaching a sill is the presence of significant trends. Therefore, the calculation of the correlation length is uncertain.
To obtain an appropriate correlation length, various methods are used, such as detrending the data or cross validation using variable correlation lengths based on the rule that a tolerance can be applied. In addition, simulation methods such as sequential Gaussian simulation are also applied, where different realizations are developed that also take into account the uncertainty of the variogram parameters. At the same time, the variability map of the measurements usually depends on the mean of the simulations.
Thus, in this work, SOM was applied to identify the locally similar input data measurements by using the associate Kohenen network node associated with that local neighborhood. The measurement locations belonging to the radius of influence of a node have as their estimation neighborhood the measurements that are associated with that node.
Another reason for using SOM to identify the estimation neighborhood is the nature of the case study considered, the island of Crete. The spatial variability of the groundwater level is studied in a large-scale complex hydrogeological system of porous and karst type aquifers. The physically based modelling of the whole system requires prior knowledge of all physical and anthropogenic boundaries and factors affecting the groundwater level variations. On the other hand, a geostatistical approach may seem more convective as a surrogate, since only the spatial/spatiotemporal dependence needs to be determined. However, in a large-scale application with the presence of a sequence of different aquifer types the obtained correlation length may not be representative of the aquifer type. Our experience shows that in some similar cases, the obtained correlation length is long and considers values of adjacent aquifers of different types, which negatively affects the estimation procedure. To avoid similar problems, this study applies the SOM method to cluster the data based on local similarity, and we consider this an innovative addition to the geostatistics methodology.

Results and discussion
The original groundwater level dataset is significantly skewed (Table 1). Therefore, the Box-Cox transformation method, with calculated power exponent k ¼ À0:32, was employed to transform the data closer to the Gaussian distribution. The data Gaussianity was significantly improved (Fig. 3), and this is also validated by the improvement in the metrics:ŝ ! = 0.25,k Y = 3.00, very close to the desirable values: s = 0, k = 3.
The global experimental variogram of the entire observations' dataset ( Fig. 3) exhibits significant fluctuations and does not appear to level off (reach a sill). On the other hand, the one of the transformed data set (Fig. 4) is more stable, and exhibits less fluctuations without reaching a clear sill as well. The best fit theoretical variogram model (Fig. 4) was determined after testing on the experimental variogram different lag distances and maximum variogram spatial distance. The main reason for these variations is the considerable inhomogeneities of the aquifer systems, which significantly affect the piezometric values of the groundwater and lead to significant variations in the level range.
A 19-by-5 nodes grid was applied in this work (Fig. 5) due to the dimensions of the island of Crete and 10,000 repetitions to determine the most representative SOM. When the data measurements associated with each node (Fig. 6) are sufficient to develop at least 30 pairs for each experimental variogram lag distance (Deutsch and Journel 1992), a new variogram is calculated. Then the calculated local parameters (after theoretical variogram fitting) are applied in the TGK prediction procedure for the estimation locations under the influence of the node, if not the calculated global variogram type (Fig. 4) using all the available measurements and its global calculated parameters are applied. The quality of the SOM is measured by two criteria: Here, the final quantization error is 0.393, and the final topographic error is 0.013; both are very close to zero indicating optimal training.
Then, a leave-one-out cross validation analysis is performed to test the prediction efficiency of the proposed method ( Table 2). The superiority of the TGK-SOM method over the classical geostatistical approaches is clear with respect to all statistical metrics, which can be explained by the more efficient selection of the estimation neighborhood with the SOM technique. The SOM clustering algorithm is an effective approach for detecting and analyzing the inherent structural modes of spatial data. It enables the interpretation of the spatial variation phenomena occurring in the study area by making the classification results for calculating the correlation neighborhood more meaningful. Figure 7 shows the spatial distribution of the bias corrected groundwater level throughout the island of Crete using the TGK-SOM method, while Fig. 8 -presents the associated uncertainty in the estimates. On the presented  groundwater level spatial distribution map (Fig. 7), it can be seen that most of Crete's coastal aquifers have levels near or below sea level, indicating that salt water has intruded. Furthermore, areas with significant groundwater resources availability and those with significant shortages, primarily due to overexploitation, can be identified. Previous studies of the island's groundwater resources assessed the major aquifers separately (Special water secretariat of Greece 2017). Furthermore, the high density of data in the island's East and Northwest allows for reducing estimation uncertainty (Fig. 8) of the groundwater level distribution in inland and mountain aquifers, which can be extremely useful for irrigation management. On the other hand, the Southwest part has fewer monitoring stations and estimates fraught with uncertainty (Fig. 8).
To further substantiate the efficiency of the proposed method, a dataset of biannual groundwater averages for the same area and year but from a different source (Decentralized Administration of Crete 2020) was used as a prediction-validation set, keeping the properties and settings of the previously applied TGK-SOM method. The new dataset consisted of 311 values spatially distributed throughout the island. The validation results were consistent with those of the leave one out cross-validation analysis; Mean absolute error: 3.31 m, Mean absolute relative error: 11.56%, and Correlation coefficient R: 0.90. Thus, the proposed methodology can be said to be an efficient geostatistical tool for estimating the spatial distribution of groundwater levels in complex aquifer systems. The spatial distribution of the mean absolute error is presented in Fig. 9. Although lower the errors compared to the other methodologies, the highest errors of TGK-SOM are located mainly towards the inland of the island, in the mountainous areas, where primarily karst aquifers exist that usually deliver inconsistent measurements due to their hydrogeological nature and properties.
The cross validation results are directly comparable with other modern groundwater-level geostatistical approaches; involving Bayesian analysis (Fasbender et al. 2008;, covariates (Ruybal et al. 2019), or modelling approaches of complex hydrogeological systems (Dokou et al. 2016;Khadim et al. 2020;Tigabu et al. 2020). The cross-validation error in these works ranges from 10 to 18%. This fact validates the efficiency of the proposed methodology and its capability of exploiting efficiently information from large datasets for competent spatial analysis.

Conclusions
The island of Crete has a complicated network of adjacent porous and karst aquifers due to diverse boundary conditions and hydrogeological features, which prohibits a thorough physical modeling of the island's aquifers as a whole. In this study, we showed that combining geostatistics with self-organizing maps to estimate the spatial distribution of groundwater level in a large-scale application of complicated hydrogeology offers a practical and reliable new approach for this kind of applications. The groundwater level data from the entire island is used in this study, and SOM is used to take into account locally similar observations for geostatistical analysis in terms of TGK. Physically based models need a lot of data, and the proposed method offers a complementary tool.
The presented approach is to be tested on more cases, to allow for wider generalizations. Future work will be also oriented towards testing other unsupervised learning approaches of machine learning that may provide additional benefits, if compared to using self-organised feature maps, and further development of other approaches hybridizing the best features of machine learning and geostatistics.
Acknowledgements The authors would like to thank the Special water secretariat of Greece for providing the data online. The national water monitoring program is presented in http://nmwn.ypeka.gr/?q= en. The InTheMED project, which is part of the PRIMA Programme supported by the European Union's Horizon 2020 Research and Innovation Programme under Grant Agreement No 1923. Funding Open access funding provided by HEAL-Link Greece.
Data availability Datasets for this research are available in (Special water secretariat of Greece 2020) and can be accessed free in http:// wfdgis.ypeka.gr/?lang=EN. In addition, the authors would like to

Conflict of interest The authors declare no conflicts of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.