Introduction

The conventional visualization of clustering algorithms output is commonly represented by scatter plots, without the ability of creating bordered areas, zones, or regions (Fig. 1). Our approach has been adopted due to the lack of methods for creating boundary-based regions. Liu et al. (2015) partially addressed this problem through developing a clustering method based on the Delaunay triangulation network. In their approach, the authors created an algorithm that clustered the data samples based on their spatial proximity from each other. The spatial boundary created based on the spatial density can be considered as part of the algorithm workflow which is used to distinguish between the different clusters. However, their mapping approach is only applicable to their clustering method and thus cannot be applied to other clustering algorithms, and their clustering method can only be applied to the latitude and longitude, and thus, it would not be useful for clustering parameters or features other than latitude and longitude.

Fig. 1
figure 1

An example of a conventional way to illustrate clustering results. The data points represent groundwater wells reported by the California Natural Resources Agency (California Natural Resources Agency 2021) projected on a topographic map (ESRI 2013). Note that the zonation of cluster points (different colors representing different subbasin) is produced without any borders or closed areas

Many fields that use clustering algorithms adopt an identical approach to represent the data, and in many cases, it is sufficient to show scatter plots with class-based colors (Fig. 1). However, there are instances in which displaying regions rather than points would be more informative and clearer. This is particularly useful in fields such as meteorology (Ohba et al. 2016; Singh et al. 2017), climatology (Köplin et al. 2013), oceanography (Sun et al. 2021; Wichmann et al. 2020), crime analysis (Lombardo and Falcone 2011; Mburu and Mutua 2023), and others (Li et al. 2020; Subba Rao and Chaudhary 2019). Displaying clusters as regions can help to identify important boundaries that can impact the output or reduce the risk of misclassification. GeoZ is an acronym that refers to geographic decision zones, an implementation of the above concept, and as its name indicates, its basic purpose is to project the decision zones of trained ML models into the geographic coordinate system (GCS).

To demonstrate the methodology and application of the boundary-based regions, we selected a groundwater well (GW) dataset that contains the longitude and latitude of the GW wells along with the subbasins they were allocated into, based on geomorphological and administrative boundaries (California Department of Water Resources (DWR) 2021). We tested conventional clusters mapping utilizing scatter plots (based on the manual introduced boundaries, Fig. 2) and basic statistical boundaries using the Voronoi tessellation method to form a picture of the data distribution and expected boundaries (Fig. 3). We created a Voronoi tessellation by forming a centroid inside each cluster. The tessellation diagram divided the study area into regions based on statistical boundaries of equal length between centroids. Although this approach provided an initial idea of the study area’s clustering, it had limitations due to the abnormal distribution of the data. Consequently, we adopted advanced machine learning (ML) models, such as support vector machines (SVM), to delineate stochastic boundaries around the data.

Fig. 2
figure 2

Man-made manual boundaries of the groundwater subbasins of California, where the clustered wells in Fig. 1 were assigned (different colors represent different subbasin) (California Natural Resources Agency 2021; ESRI 2013)

Fig. 3
figure 3

An example of Voronoi tessellation delineating subbasins (red lines) using calculated cluster centroids (blue dots) based on the same data used in Fig. 1. Note that the created boundaries are largely controlled by the data distribution and irrespective of the natural subbasin boundaries

Despite the large amount of data usually available in clustering problems, data point density over the study area can be low in large areas, creating regions of uncertainty that are often discarded from consideration. To overcome this, we utilized ML algorithms to train a model on the available data and predict the areas of sparse data. We fed the model sequential points to fill a grid covering the entire study period and drew its decision boundaries into the GCS to see how the boundaries would reflect in the real world. Although the proposed approach had the disadvantage of depending on the availability of data for accuracy, it provided a new tool for mapping clustering datasets without well-defined boundaries, which can be rapidly and accurately represented using ML models.

Study Area and Data

Study Area

The State of California was selected for the case study to demonstrate the capabilities of our novel algorithm. The state ranks third in the USA in terms of size, with an area of almost half a million square kilometers (Prothero 2017). This makes California home to many unique natural systems, including several major water bodies and diverse natural terrains. The vast area of the state allows for various hydrological conditions and exhibits a distinctive and complex hydrogeological canvas (Carle 2015; Prothero 2017). Unlike GW basins, the GW subbasins in California are divided according to a combination of natural boundaries and approved administrative boundaries (California Department of Water Resources (DWR) 2021). This approach creates a complex and unique map that serves as an excellent illustration of the need for region-based representation (Fig. 2) and could serve as a potential utilization of modeling capabilities of the GeoZ library.

Datasets

The dataset employed in this study was provided by the California DWR and is publicly available on the DWR website (California Natural Resources Agency 2021). It contains 17 columns and 45,923 rows comprising the geographic locations of the wells drilled in the State of California and many other characteristics, such as the well type, uses, and depth, as well as the program monitoring the well because most wells fall under the jurisdiction of different agencies. For the purpose of our study, we filtered all columns except three: geographical coordinates of the GW wells (latitude and longitude) and the subbasin classification of each well.

The dataset was cleaned to remove the duplicates. In addition, any well that did not have a subbasin name classification was removed. This data filtration ensured that no null values were encountered by the model during training, thus avoiding any interruption during code execution. After dataset cleaning, the data of 42,868 wells distributed across 287 unique subbasins remained. The subbasin names were then encoded into numbers instead of their original names as strings. This was done to simulate the output of the clustering algorithms, given that GeoZ was specifically designed to address this issue. Another reason for encoding the names was to allow the bazel_cluster function to work without facing any issues (details of the bazel_cluster function will be discussed later in this paper).

Methodology

Theoretical Concept

To construct a practical library, we identified our requirements and the limitations associated with such demands. The premise of the library began with the simple goal of delineating the boundaries around the available data points. Similar to any data-driven model, the accuracy of the model is directly related to the amount of data fed into it. The model input comprised two features and one label column. The two features were latitude and longitude in their actual values, without scaling or normalization. The labeled data were the groups or classes containing the data points. Using these settings allowed us to train the model to identify the regions surrounding each point cluster while exhibiting a limited amount of uncertainty regarding the extent of each cluster’s boundary, a feature that is intended to be controlled by the prospective model’s hyperparameters.

Training the ML model with pure geographical data allowed us to restrict the feature space of the model to the GCS, thereby overlapping them. Thus, we can directly project the model predictions onto any geographical map. However, introducing geographical data without scaling or normalizing them could reduce the model accuracy under certain circumstances wherein the variance between the points becomes significant (Ozsahin et al. 2022). However, because of the rarity of these events in the spatial domain and considering this risk when using the model, these effects can be alleviated. As the model feature space and GCS overlap, the model output can be visualized on the surface representation of the Earth.

The next step is to find a model that is capable of predicting the data point classes with high precision based only on the provided latitude and longitude. Training the model on a reasonable amount of data allows it to create decision boundaries or zones (DZ) around each class. Thus, when a point’s latitude and longitude fall within the stated zone, the model classifies it as a class that covers that geographic boundary. Constricting the model input to latitude and longitude would project these DZ onto the real world. Thus, each cluster DZ would be drawn as the spatial boundary surrounding the cluster on the geographic map when visualized using a decision boundary plotting tool. Three decision boundary plotting tools with different properties were utilized in GeoZ and discussed in detail through the “Implementation: GeoZ Library” section. A workflow of the process is illustrated in Fig. 4.

Fig. 4
figure 4

Schematic diagram showing the different steps followed in the development of geographic decision zone map (GeoZ) starting with the dataset that was clustered and then used as labeled input data to train a supervised ML model ending with the production of GeoZ after being processed and mapped by the decision boundary plotting tool

We demonstrate the theoretical concept in Fig. 5, which shows the three layers. The top layer corresponds to the GW subbasins in California, which are color coded to facilitate differentiation. This layer represents our ground truth and the reference for accuracy verification. The second layer contains all the available data points (in blue) assigned to the GW subbasin location, which represent sample points used to train the ML model. The third layer shows the prediction model based on the data provided in the second layer. In this instance, the prediction is from K-means, one of the many ML models tested in this study, as described in the next section. The subbasin boundaries are shown in red, and the correspondence of the model input features to the latitude and longitude in the GCS is also indicated.

Fig. 5
figure 5

3D conceptualization of the proposed theoretical concept to predict the GW subbasin boundaries. The first layer represents the manually delineated GW subbasins (Fig. 2), the second layer represents the GW wells (data points) shown in Fig. 1 used in the model, and the third layer shows the model prediction (K-means predictions). The model prediction was based on the longitude and latitude of the data from the second layer (created using Python, ArcGIS Pro, Inkscape, and Paint 3D)

Because most of the models used in this study originated from the scikit-learn library (Pedregosa et al. 2011), we used their SCORE function implementation to measure the prediction (or drawing) accuracy of each classifier. The score function is part of their general API; therefore, it is inherited by most classifiers and, if required, modified to accommodate their nature. The function checks the correctly predicted values, divides them by all predicted values, and returns the percentage as a score.

In ML, datasets are usually divided into training and testing sets to evaluate the performance of the trained model. However, in our case, we fed all datasets into the model without removing a test set. The reasoning behind our approach can be explained by the purpose of the trained model. In our use-case, the main goal was to delineate a boundary around the given points, similar to a convex hull (Barber et al. 1996), and not to predict the values of future points. Therefore, although we can use clustering algorithms to achieve comparable results, testing their accuracy will be difficult. Moreover, akin to clustering algorithms, the model will not be used to make any future predictions or extrapolations from the training set, as it only needs to follow the data and draw the boundaries.

Based on all the aforementioned conditions, we can even argue that overfitting the model to a certain degree is acceptable. However, including a regularization hyperparameter in the model would be advantageous, as it would allow us to control the model fitness without sacrificing any data to measure or increase its generalizability. Therefore, to measure the model performance, we fed the same dataset that was used to train them into the score function, and any misclassified points would decrease the accuracy score and damage the shape of the drawn decision boundary.

ML Models

By establishing the basics of our model requirements, we can iteratively elaborate on any extra additions to the requirement list as we compare the model outputs with the ground truth to address any flaws or shortcomings in its performance. Because we have two features and a label, the problem can be considered as a simple classification problem. Based on the “No Free Lunch” theorem (Wolpert and Macready 1997), we decided to try most of the ML classification models available in scikit-learn library, as well as some of their clustering algorithms.

As discussed in the “Introduction” section, the purpose of this library was to delineate GW subbasins based on the available well data. Based on the complexity of the GW subsurface structures, we knew beforehand that linear models would fail to follow their shape and would make predictions with low accuracy. However, we still included linear models in our experiment to observe how they would behave while trying to address the complex shapes of the subsurface, as this could provide some insights into properly tuning more advanced models while minimizing costs (time and resources). Another reason for the linear model implementation was their code simplicity, as the scikit-learn general API enabled the testing of various classification models with just a few extra lines of code; hence, exploring the problem from different angles would be worth the time it takes to write these codes.

The models tested in this research included three clustering algorithms: (1) K-means, (2) Gaussian mixture model (GMM), and (3) Bayesian GMM. Nine classification algorithms were used: (4) linear regression, (5) Bayesian ridge regression, (6) logistic regression, (7) artificial neural network (ANN), (8) k-nearest neighbors, (9) linear discriminant analysis, (10) histogram-based gradient boosting, (11) AdaBoost, (12) Gaussian Naive Bayes (NB), and (13) SVM. To optimize the models’ performance, we primarily worked with the default hyperparameters and only devoted a few hours to manual tuning when suboptimal performance was observed based on the given dataset. We set a maximum time limit of one working day for parameter optimization to identify which tools could effectively function with the data without requiring extensive hyperparameter tuning. This approach allowed us to implement the classifier with default settings within the mapping function, minimizing user interference or modification. The time limit also provided an opportunity to assess the ease of tuning the model hyperparameters and the level of knowledge and experience required to achieve satisfactory results.

Requirements

Most of the models we tested, except for the pure linear models, produced acceptable prediction accuracies; however, the resulting geographical maps did not accurately represent reality. This was mainly because of interpolation, extrapolation, or delineation errors based on the model functionality and working mechanism. As a result, we introduced more requirements in our list to address the shortcomings of the model, eventually reaching the following 6-clause list:

  1. 1.

    The model must avoid extrapolation as much as possible.

  2. 2.

    If extrapolation is required, it must be performed such that it does not affect the delineation of the subsurface boundaries or distort them.

  3. 3.

    The model must produce arbitrary shapes instead of only geometric ones; thus, it must achieve a complexity level that would allow it to delineate the subbasin boundaries as accurately as possible (this requirement disqualifies most linear models owing to their linear nature).

  4. 4.

    The model must be as simple as possible to save computational resources and time.

  5. 5.

    It should be stable and robust against outliers with low to moderate sensitivity.

  6. 6.

    It should be flexible (hyperparameter-wise) to allow for some control over any useful elements (including the uncertainty) in the produced maps.

This list is not exhaustive and can be amended in the future; hence, it should be considered as a guideline rather than a list of restrictive rules. This list was formulated based on our experiments with GW data, and it can require fewer or even more rules when attempting to model data from any other field. The assessment of item five was not feasible in the present research; therefore, its inclusion is based on the acknowledged characteristics of each method (e.g., Gaussian NB is acknowledged to be vulnerable to outliers, whereas tree-based methods are known for their robustness against them).

The experiments documented in this study were performed on a computation node in the high-performance computing (HPC) system at the UAE University. The node comprised a 36-core processor, 377 GB of RAM, and Linux operating system. These were the initial conditions used for the experiments; however, after the creation of the GeoZ library and its publication in the Python Package Index (PyPI), it was used and tested on different systems, including a Windows system comprising a Core-i9 processor and 64 GB of RAM and a Linux system comprising a Core-i7 processor and 24 GB of RAM. We expected that the initial high hardware requirements are no longer necessary to run the library, and normal computer systems with an adequate amount of RAM would be able to run it with no issues; however, the proportional increase in RAM size based on the dataset size intended to be drawn must be considered.

Results and Discussion

In addition to the visual inspection of the produced maps to identify their flaws and inaccuracies, we also recorded their accuracy scores, which can indicate minor errors that might be difficult to detect through visual inspection. Based on the aforementioned tuning restrictions, most models achieved relatively accurate scores. The ANN implementation on scikit-learn was an exception. It performed poorly despite being one of the most complex models available (Ismailov 2023). Our efforts to tune its hyperparameters within the self-imposed time limits did not yield any noteworthy increase in accuracy. This made it reasonable to exclude ANN algorithms from the prospective classifiers as most of them require certain settings and hyperparameter tuning to adjust them according to the provided dataset (Gupta et al. 2021). Unlike normal scenarios, wherein the trained model is a part of the middle process and its inference ability is the final product, in our use-case, the trained model is the final product. Therefore, it is imperative to find a model that requires the least amount of user interference for training.

Table 1 shows the results of running each model on the dataset and the number of rules from the requirement list it was able to satisfy. We also noted the time required for each run, which included the times required to train the model and draw the final map. The only model that achieved a high accuracy score, passed the visual inspection, and satisfied all the requirements was the SVM, albeit with some limitations, which will be discussed later in detail along with our attempts to address them. However, because it met the minimum requirements, we were able to use it as the base model for the GeoZ library drawing mechanism; therefore, it can be considered for any kind of similar research or even in a production environment while bearing in mind the limitations imposed by its nature.

Table 1 Accuracy, time required, and the requirements satisfied by the different methods tested in this study

The results of the modeling experiments are shown in Fig. 6, and evidently, most linear models achieved low accuracies and showed linear structures in the map, which is far from reality. The accuracies of the clustering algorithms could not be measured using the score function; however, they showed a clear separation between the clusters and an excellent ability to follow the shape of the GW subbasins. Their disadvantage was that they use something akin to a linear extrapolation for any location outside the data conglomeration. This is particularly evident in the maps generated using K-means and k-nearest neighbor algorithms. The tree-based classifiers achieved very high accuracy; however, their dependence on geometrical shapes owing to their tree-based decision-making nature created unnatural maps that did not reflect the shapes of the GW subbasins, even if they were accurately predicted.

Fig. 6
figure 6

Illustrations of the performances of the ML models tested in this study (except SVM). The models attempted to delineate the subbasin regions based on the provided data. Different regions are indicated using different shades of green. The red lines indicate the ground truths of the subbasins, and the well locations are indicated in white. The model numbers (in white squares) correspond to their numbers in Table 1

The GMM and Bayesian GMM are clustering algorithms that depend on the normal distribution of samples; however, as they cannot provide labeled data to the algorithms because they are cluster-based, several wells were inaccurately clustered. However, the GMM showed good ability to follow the shape of the GW subbasins. Finally, the SVM (depicted individually in Fig. 7a) and the Gaussian NB achieved high accuracies and showed the best ability to follow the shapes of the GW subbasins and could restrict generalization to a small area around the data instead of extrapolating them to infinity.

Fig. 7
figure 7

SVM classifier maps illustrating the effects of using the Bazel Cluster function in the GeoZ library. a Map drawn using the MLxtend module without using the Bazel Cluster function. b Map drawn using the MLxtend module and Bazel Cluster function. The red circles indicate the “Generalized Cluster” selected by the SVM classifier in a and its transfer to the Bazel Cluster in b, thus maintaining all the cluster representations in the map

SVM

Some of the key differences between the two best methods in this study, SVM and Gaussian NB, are that SVM achieved higher accuracy and showed good mapping flexibility owing to its hyperparameter options, which allowed us to control the extent to which the model would extrapolate the boundary around the data as well as how interconnected were the distant points for each cluster. However, the SVM algorithm was distinguishable from all the tested algorithms in the manner in which it created the decision boundary for each cluster. SVM adopts an approach similar to the convex hull approach, but with a more flexible and closer follow-up to the data distribution. This creates clusters as closed-form boundaries surrounding the data, while leaving the remainder of the map empty. In such cases, most models attempt to generalize each cluster in a certain manner based on their mathematical nature to prevent any emptiness in the feature space. This means that if the model is provided with features, it will predict a label, even if that label is wrong or does not have a logical connection to the provided data, as evident from the Gaussian NB generalization of the edges.

In contrast, SVM creates a simple boundary around each cluster and then generalizes all the empty space and background as one of the newly created clusters. This allows for accurate representations of all clusters except one, which we call the “Generalized Cluster” for the sake of practicality. Generally, the densest or most dispersed cluster is selected as the Generalized Cluster. The SVM classifier generalizes this cluster to the entire feature space outside the other clusters. When the decision boundary is drawn, the Generalized Cluster appears as the background, whereas the remaining clusters are surrounded by it, appearing as if they are floating on top of it. This can be clearly observed in Fig. 7a. Despite being one of the weaknesses of the SVM, it is also one of its strengths. By restricting the generalization issue to one cluster, we can create solutions to resolve the issue in that one cluster instead of trying to address each cluster generalization individually, as is the case with other classification methods. We attempted to address this issue by creating a function called “Bazel Cluster.”

The SVM algorithm has two hyperparameters that can be modified to increase the model accuracy and control its behavior. The first is the C parameter, which controls the degree of influence each point has on the decision boundary of the classification as well as the importance of each point. Increasing C increases the importance of each point, which consequently increases the influence of each point on the location of the decision boundary, and decreasing it would result in the opposite. The gamma hyperparameter is the second parameter that controls the SVM behavior. It acts as a regularization parameter that prevents the model from overfitting. These two parameters have different effects on map creation.

Based on our experiments, the gamma hyperparameter functions as a controller for the uncertainty regions of the boundary. As a regularization parameter, it is inversely related to the buffer zone around the points of the cluster as well as the interconnectedness of the distant points within the same cluster/class. As a result, decreasing the value of this parameter would increase the buffer zone around the points, thereby increasing the uncertainty of the cluster boundaries compared to reality. Therefore, increasing its value would decrease the buffer zone around the points and consequently decrease the uncertainty of the cluster’s boundaries compared to reality. The C parameter controls the outlier effect on the boundary location; increasing its value forces the model to consider each point, and the boundary must follow the point location. In contrast, decreasing its value would smoothen the boundary and allow the model to ignore some points as outliers or misclassification errors. To obtain an accurate representation of the data, it is imperative to substantially increase the C parameter and control the gamma hyperparameter to a certain degree, based on the dataset.

Bazel Cluster

The Bazel function is a Python function that was named based on the English word “Bezel” to indicate its purpose, which involves acting as a bezel surrounding the cluster map. The “e” in “bezel” was replaced with an “a” to ease the process of identifying the function calling inside the Jupyter environment, which significantly helped with bug fixing during development. The function was created solely to address the weakness of the SVM algorithm; hence, it would be unwise to utilize it elsewhere. The Bazel Cluster function works by receiving the dataset intended for visualization, and based on the data coordinates, it determines the edges of the data distribution and creates a bezel or a frame that envelops the data. The bezel is formed by generating new data points. By default, their numbers are equal to the number of points in the largest cluster in the dataset plus one extra point; however, the user can adjust the number of samples in the frame. The frame is located 1 standard deviation (SD) away from the map edges; its width is also 1 SD but can be adjusted according to the user input.

This function works by adding an extra cluster of points that is dispersed around the map and has more points than the largest cluster in the dataset. Consequently, it forces the model to consider it as a Generalized Cluster. Once the model considers it as a Generalized Cluster, the produced map contains all the actual clusters, whereas the Bazel Cluster disappears and acts as the map background (Fig. 7b). Unfortunately, this method is not always effective because the model sometimes selects one of the actual clusters as the background. However, in such cases, it is easy to detect the failure as the Bazel Cluster would appear as a clear frame surrounding the map; in this case, the user can adjust the Bazel Cluster parameters to increase the frame width or the number of samples to force the SVM classifier to consider the Bazel Cluster as a Generalized Cluster. This process can be iterated until the expected result is achieved. The Bazel Cluster function is disabled by default in GeoZ but can be enabled through the “bazel” parameter.

Limitations

Despite addressing the generalization issue of the SVM classifier, this implementation has several limitations. Some of the major ones are as follows:

  • Hyperparameter tuning: in our experiments, the best visual and accurate results were obtained by setting the kernel to the radial basis function (RBF) and the gamma hyperparameter to 30; however, depending on the application or even the study area, the gamma value might have to be changed.

  • To overlap the feature space and GCS, we must maintain the original latitude and longitude values; however, a significant difference in the latitude and longitude between points in the dataset would considerably weaken the model’s ability to correctly predict and draw the boundaries. Therefore, the map area must be considered while viewing the dataset.

  • Another aspect of map size is related to the nature of GCS. Many ML algorithms require continuous data; however, the GCS is discontinuous, and this discontinuity would most probably not be considered by the ML models. Thus, it will produce incorrect results and even non-existent coordinates when it infers the cluster boundaries. An SVM using the RBF kernel can predict nonlinear relations and is therefore expected to predict discontinuous spaces; however, its performance highly depends on the quality and quantity of the provided dataset (De Marchi et al. 2020). Hence, it is preferable that the user works with map areas that do not cross the discontinuities of the GCS boundaries (180, − 180) and (90, − 90).

  • The computation requirements (primarily the RAM) for running the library are directly related to the size of the input dataset as the SVM classifier tries to load the entire dataset onto the RAM before processing it; thus, if the dataset is larger than the system RAM size, the algorithm will fail. Therefore, it would be wise to consider the specifications of the device running the library if the dataset is very large (> 100,000 records).

  • The Bazel Cluster function sometimes requires several modifications to successfully force the classifier to consider it as the background. However, the method is not guaranteed to work; hence, it only represents a good prospect for enhancing the fundamental concept or operational mechanism of the function.

  • The SVM classifier is robust against outliers; however, in our case, this ability can be considered a risk to the model delineation process. This is mainly because of our certainty that all data used in training are actual data and not outliers; hence, the removal of any data is detrimental to model accuracy. We have yet to find a solution to this limitation other than trying to manipulate the SVM hyperparameters.

  • Finally, aside from the kernel, the SVM classifier has two hyperparameters (C and gamma), both of which affect the model results and the proximity on the final map shape. Therefore, deciding the appropriate values for the SVM hyperparameters can be an issue, especially because they can differ depending on the field. In our experiments, the C hyperparameter did not have a significant effect as gamma on the model classification or map shape; however, because we are aware that all points matter and are accurate, we elected to assign a value of 100 to the C parameter to force it to consider all the points, which produced optimum results for our dataset. Regarding gamma, we discovered that the optimum value to represent GW subbasins is “30”; however, depending on the user preference in dealing with uncertainty, it can be increased up to “1000.”

Implementation: GeoZ Library

GeoZ is a Python library that was developed to implement the proposed theoretical approach. The library integrates several ML algorithms to create geographic maps for the output of unsupervised ML techniques, primarily clustering algorithms. It is written entirely in Python and is open-source with a BSD 3-Clause license. It was also published in PyPI. GeoZ contains three modules that perform similar tasks of creating geographic maps from the output of clustering algorithms, but with different approaches and using various libraries. However, it should be noted that Matplotlib is the backend drawing library used in both GeoZ and most Python drawing algorithms (Hunter 2007). Brief descriptions of the modules included in GeoZ and their purpose are provided in the following sections. The parameters within each module are detailed in the function’s documentation; therefore, to avoid redundancy, we did not elaborate on them in this study.

Convex Hull Module

This module creates a convex hull for each set of points that belong to a distinct cluster using Shapely’s “convex_hull” operation (Gillies et al. 2022), which is iterated for each cluster to eventually draw a map that contains all the clustered data. The main advantage of this method is that it can detect all evident overlaps in the clustering algorithm; other methods cannot derive overlapped regions owing to the underlying modeling algorithms (SVM). However, owing to its constricted geometrical drawing ability, it is incapable of accurately delineating the cluster regions, nor should it be used for that. Because this method does not involve the use of any ML algorithms, it executes quickly and is suitable for initial testing and prototyping of the clustering algorithm’s parameters to a certain degree.

Decision Boundary Display Module

This module utilizes scikit-learn’s “DecisionBoundaryDisplay” class (Pedregosa et al. 2011) to derive a geographic map. It also utilizes the Geopandas library to draw points on a map (Jordahl et al. 2022). This method is advantageous that it provides a significant amount of flexibility to users to modify and adjust the map according to their preference, as opposed to the other methods included in the GeoZ library, which are better suited for prototyping and quick drafts as one of the options available to the user. They allow users to reduce the resolution, thus producing more maps and variations in a short time.

Decision Region Module

This module uses MLxtend’s “decision_regions” function (Raschka 2018) to draw the map. The advantage of this method is that it produces a high-resolution detailed map in addition to using decision regions containing different colors and symbols for the data points to represent different clusters. This is a considerable advantage over default color schemes used in scikit-learn; however, the number of colors used increases with the increase in the number of clusters, forcing the algorithm to cycle through the same set and resulting in confusion for the end users. Therefore, adding symbols to differentiate between clusters, in addition to the colors of the different regions, is a significant advantage. However, the high resolution of the output generated by this method limits its usage; this is because this method requires a significant amount of time to draw the maps, which becomes a disadvantage during prototyping. This plotting method therefore should be used for creating the final exported map. The end results and library effect are demonstrated in a side-by-side comparison illustrating the actual subbasins, which is the classical method of drawing cluster results, and the proposed method’s mapping ability was obtained using MLxtend’s “decision_regions” function (Fig. 8).

Fig. 8
figure 8

Side-by-side comparison showing the a actual GW subbasin map, b classical method of portraying clustering results in the middle, and c method mapping results obtained using MLxtend’s “decision_regions” function

Conclusions and Future Work

Our approach achieved a 99.1% accuracy in delineating the GW subbasins of California using a trained ML classification model employing data from the GWDB. SVM was the only ML model among the 13 tested models that fulfilled all the requirements for using it as a base model for the mapping library. We also highlighted the limitations that restrict the use of the model. Furthermore, we attempted to address the “Generalized Cluster” issue by creating a function called “bazel_cluster,” which has a high success rate in addressing the SVM limitation and provides clear signs when it fails. We implemented three mapping modules inside GeoZ to address the various expected use cases of the library. GeoZ has been made available in the PyPI, and its source code has been made available on GitHub (ElHaj 2023).

The library is being used in most of our ongoing work to delineate GW subbasins and aquifers. However, there is still room for improvement. One aspect would be the inclusion of a third dimension to illustrate the depth along with the lengths and widths of our areas of interest. Matplotlib is the backend drawing library used in GeoZ, and it includes 3D capabilities; therefore, it would be theoretically possible to achieve this. However, the decision zone drawing libraries can only draw the model inference in two dimensions. As a result, to create a 3D representation of our clusters, it is imperative to create a decision boundary/zone mapping library from the ground up or develop one of the established libraries used in GeoZ to accommodate 3D capabilities.

In addition to the geosciences field, GeoZ can also be used in other fields that use unsupervised clustering to create decision zones. A good example of such a field is crime analysis. In this field, substantial research is conducted by employing clustering algorithms to determine the degree of risk for each geographic region. However, the resulting maps can color the data points based on their classification and display them as scatter points on the map to determine the regions, without establishing their boundaries or zones. Even when boundaries are determined, they are mostly predefined regions based on administrative or natural boundaries, unlike GeoZ, wherein the boundaries are dynamic, clearly defined, and determined based on the cluster’s sample distribution. To the best of our knowledge, there are currently no known methods or libraries that can achieve the capabilities of GeoZ. Therefore, we hope that our study will offer a significant contribution in the fields of GIS and ML.